Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Parsing HTML files

by Tux (Monsignor)
on Nov 18, 2010 at 07:46 UTC ( #872151=note: print w/ replies, xml ) Need Help??


in reply to Parsing HTML files

With HTML::TreeBuilder, as Your Mother already mentioned, you can do so, but please keep in mind that html may change. I have several monitors running that parse HTML constantly, and I have to change the code on a very regular basis because the people that generate or maintain the HTML keep changing it. So on true advice: be very very defensive in your parsing strategy and don't hardcode the sequence of events: the generator might add a div tag in between or swap the sequence of text and image.

use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_content ($html); foreach my $img ($tree->look_down (_tag => "img")) { my $p = $img->parent; $p->tag eq "div" or next; # <img> not inside a <div> my $txt = $p->as_text; }

As you can see, this module offers you all rope you need to hang yourself or do what you need. It also offers a nice way to generate nicely formatted HTML from parsed trees:

print $tree->as_HTML (undef, " ", {});

Enjoy, Have FUN! H.Merijn


Comment on Re: Parsing HTML files
Select or Download Code
Re^2: Parsing HTML files
by ajju on Nov 18, 2010 at 19:57 UTC
    hi Tux, I had my $html="htmlfilepath"; added to your code. Running your code is giving the below error, Use of uninitialized value in subroutine entry at C:/Perl/site/lib/HTML/TreeBuil der.pm line 121.

      If you would have taken the time to read the documentatio, e.g. using "perldoc HTML::TreeBuilder", you should have seen, if the method name parse_content wasn't obvious enough already, that to parse a file, you should use the parse_file method.


      Enjoy, Have FUN! H.Merijn
Re^2: Parsing HTML files
by aquarium (Curate) on Nov 18, 2010 at 22:31 UTC
    totally agree that scraping html is quite bad and unstable. my rough guide for scraping, from most to least desireable
    • don't scrape...if you can find out if there's a RESTful way to get the results instead, via some API or alternate (e.g. XML format) url
    • if the html is well formed (i.e. xhtml) then it will be almost guaranteed to be well structured, will proper closing tags etc...so use one of the XML based parsers. you can easily get to specific elements via a well defined hierarchy once parsed
    • this stuff gets ugly when you start slurping whole html into a scalar and progressively find markers where to suck bits into desired variables for inspection. even here you can code a bit defensively by picking sane markers, e.g. "id" or "class" elements. never anchor to text containing inline html styles or other bits that are likely to change fequently, like inline javascript or such.
    finally, if you end up producing html in the CGI, don't mix actual output with styling. write a stylesheet instead. producing a well formed xhtml document in the CGI without inline styles, provides later opportunity to use the output of the CGI via another CGI or whatever. it's also much easier to change the output of the CGI via a stylesheet, rather than digging in perl code.
    there are frameworks for doing even fancier scraping, where you end up running a browser engine server side, to pretend that your program is a browser. this is necessary when a website dynamically produces most of it's output with javascript. and naturally because javascript is browser/client side code, you won't see the results of that unless you run it. this is pretty horrid stuff. although you can do automatic login and traverse a website and results...it typically breaks as soon as absolutely anything changes on the website. A good/helpful website, even if dynamically fancy rendered with javascript, should provide a RESTful api to get data out. But some companies still insist on not being very helpful.
    the hardest line to type correctly is: stty erase ^H

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://872151]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2014-12-21 03:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (102 votes), past polls