Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

HTML Parser suggestions

by spatterson (Pilgrim)
on Jan 11, 2013 at 21:20 UTC ( [id://1012980]=perlquestion: print w/replies, xml ) Need Help??

spatterson has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks, I'm returning to perl after a long absence with a need to parse some autogenerated HTML - in a tree based fashion & searching for specific class attributes on tags.

As there are loads of HTML parsing modules, which ones do fellow monks suggest?

Replies are listed 'Best First'.
Re: HTML Parser suggestions
by moritz (Cardinal) on Jan 11, 2013 at 21:44 UTC
Re: HTML Parser suggestions
by tobyink (Canon) on Jan 11, 2013 at 21:27 UTC

    I'm biased, but I'll suggest HTML::HTML5::Parser. It uses the HTML5 parsing algorithm, so if faced with messy tag soup HTML, should very closely match how most desktop browsers parse HTML.

    Quick example:

    use 5.010; use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; my @elements = HTML::HTML5::Parser:: -> load_html(location => "http://www.perlmonks.org/?node_id=101298 +0") -> querySelectorAll("title"); say for @elements;
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: HTML Parser suggestions
by LanX (Saint) on Jan 11, 2013 at 21:28 UTC
    parsing HTML is a frequently asked topic and I suppose it can't be answered w/o more details about your specific problems.

    searching the monastery shows plenty of discussions, maybe you wanna dig in and ask again?

    EDIT: A quick look seems to suggest that HTML::TreeBuilder is popular.

    Cheers Rolf

Re: HTML Parser suggestions
by blue_cowdawg (Monsignor) on Jan 11, 2013 at 21:22 UTC
        As there are loads of HTML parsing modules, which ones do fellow monks suggest?

    I've used HTML::Parser a few times.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: HTML Parser suggestions
by Anonymous Monk on Jan 12, 2013 at 02:58 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1012980]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-03-19 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found