Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

In HTML , I Want to process only Data and Not tags

by sanPerl (Friar)
on Jul 25, 2006 at 19:51 UTC ( [id://563619]=perlquestion: print w/replies, xml ) Need Help??

sanPerl has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I need to change all the data in some HTML file. Now since the data is between tags. I capture data like this. (Please assume that $htmlbuffer contains containts of HTML file.)
$htmlbuffer =~ s{>(.*?)<} { my ($mydata) = ($1); $mydata =~ s/abcd/efgh/gs; $mydata =~ s/yyy/zzz/gs; “>$mydata<” }exgs;

Now here what happens is; the buffer $htmlbuffer gets called many times, just because I want to process data between tags and don’t want to disturb tags. I face same problem while processing XML data also.
This is slowing the program. I am sure someone expert will suggest me a better way.

Regards,
Sandeep

Replies are listed 'Best First'.
Re: In HTML , I Want to process only Data and Not tags
by GrandFather (Saint) on Jul 25, 2006 at 20:39 UTC

      I second the vote for HTML::TreeBuilder, but I also would like to recommend XML::TreeBuilder. It uses the same handy API, which just makes my life so much simpler. There are most likely cases where other modules -- such as XML::Twig -- make more sense, but I don't know of them off the top of my head.

Re: In HTML , I Want to process only Data and Not tags
by lorn (Monk) on Jul 25, 2006 at 20:25 UTC
      I think you mean to suggest something like:
      s/>[^>]+</.../
      But in general it's not a good idea to try to roll your own HTML or XML parsing solution when there are plenty of good ones out there.

        I'm sure it does. But what does it work for? As shown it is a match that doesn't capture anything and will match a < at the start of a line, followed by anything at all for as much as it can manage, until it finds a >. For example, all the following match:

        '<>' '<tag>' "< line of quoted text in an email using '<' instead of the more usual + '>'" '<tag>the stuff OP wanted to retreive</tag>'

        note that what is matched isn't even what OP wants to retreive. OP was after element data - the bit between a start tag and a end tag.

        BTW, the regex matches the whole last sample line, not just the start tag as you might have expected: .* is greedy.


        DWIM is Perl's answer to Gödel

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://563619]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-04-23 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found