Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

by jeffa (Chancellor)
on Aug 23, 2003 at 13:35 UTC ( #286050=note: print w/ replies, xml ) Need Help??

in reply to HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

If the HTML you are parsing happens to be valid XML (they call that XHTML these days ;)) then you can use the uber module XML::Twig:

use strict; use XML::Twig; my $t = XML::Twig->new( twig_handlers => {table => \&handler}, pretty_print => 'indented', ); $t->parse(\*DATA); sub handler { my($t,$table) = @_; $table->flush if $table->att('border') == 0 and $table->att('align') eq 'center' ; } __DATA__ <html xmlns=""> <head> <title>XML::Twig table extract test</title> </head> <body> <table><tr><td> <table border="0" align="center"> <tr align="center"> <td colspan="3"><a href="/a.gif"><img src="/a.png" alt="a" /> </a></td> </tr> </table> </td></tr></table> </body> </html>
But, the HTML you have posted is not valid XHTML. I ran your HTML through HTML Tidy and embedded it inside another table for testing purposes. You can always fetch the web page and call HTML Tidy externally, or you can install XML::LibXML and use the technique merlyn presents at HTML tidy, using XML::LibXML to clean up the HTML you have to parse.

So, why use something like XML::Twig instead of an HTML::Parser? Because you are extracted out a whole subset of HTML instead of individual tags or text. Another good candidate module for this kind of work is XML::XPath. The XPath language was designed to "address parts of an XML document". Here is a quick example that uses XPath (and the same DATA filehandle):

use strict; use XML::XPath; my $xpath = XML::XPath->new(ioref => \*DATA); my $nodes = $xpath->find('//table[@border=0][@align="center"]'); print XML::XPath::XMLParser::as_string($nodes->get_nodelist);
How's that for 5 lines of code? ;) You can find a good XPath tutorial at, by the way.

It is important to use the right tool for the job, and i think that XML::Twig and XML::XPath are better tools for this job than the HTML parsers.

Hope this helps :)


(the triplet paradiddle with high-hat)

Comment on Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?
Select or Download Code

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://286050]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2014-07-29 08:44 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (212 votes), past polls