http://www.perlmonks.org?node_id=1055207


in reply to Re^2: The State of Web spidering in Perl
in thread The State of Web spidering in Perl

I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div@id='blah'/p though, do you explicitly maintain state?

You don't -- you might use HTML::Parser if you want to reinvent HTML::Tree. Its like XML::Parser, you might use it if you want to reinvent XML::Twig, but since both Tree/Twig exist and do a fantastic job already , don't waste your time reinventing them :)

And now my linkdump of examples docs tutorials ... because xml::parser is low level, you should parse html/xml with xpath/twig/dom, Re: How to grab a portion of file with regex (don't),
HTML Parser suggestions
See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex
How do I match XML, HTML, or other nasty, ugly things with a regex?
How do I remove HTML from a string?
Re: Parsing webpages

See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions

See also htmltreexpather.pl and xpather.pl

htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions

xpather.pl
Re: Get Node Value from irregular XML (xpather.pl)
Re: Having trouble with siblings
Re^2: XML parsing and Lists
Re: Counting number of child nodes based on element value (typos)
Re^3: Extracting specific childnodes (xpath whitespace)
Re^3: Extracting specific childnodes (play xmllint --shell )
Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath?
Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath?
Re: How to parse xml with namespase vale in XMl:LibXML? ( XPath error : Undefined namespace prefix )
Re^2: How to parse xml with namespase vale in XMl:LibXML? (xmllint --shell setns / xpathtester)

There is a better way :)

  • Comment on Re^3: The State of Web spidering in Perl