Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

HTML::Parser fun

by FreakyGreenLeaky (Sexton)
on Jun 04, 2008 at 12:19 UTC ( #690119=perlquestion: print w/ replies, xml ) Need Help??
FreakyGreenLeaky has asked for the wisdom of the Perl Monks concerning the following question:

Apologies for the long post. I'm stumped and our usual (paid) Perl freelance goto guy can't get to it.

I'm using HTML::Parser to extract links and associated text and it works, except for a few issues. It's these few nigglies I need help with.

The script currently outputs:

0. [refresh] ................................. [web.asite.com/tmpgifs +/zz] 1. [www.gone.com] ............................ [linktext] 2. [www.gone2.com] ........................... [linktext2] 3. [www.gone3.com] ........................... [linktext3 ] 4. [a] ....................................... [bold word] 5. [www.linkfromimage.com1] .................. [text next to image] 6. [href] .................................... [www.linkfromimage.com +2] 7. [destinations/western-cape/map.aspx] ...... [abc] 8. [destinations/free-state/map.aspx] ........ [Free State Map] 9. [destinations/free-state/map2.aspx] ....... [Free State Map2] 10. [a] ....................................... [email link text]
But I need:
0. [refresh] ................................. [web.asite.com/tmpgifs +/zz] 1. [www.gone.com] ............................ [linktext] 2. [www.gone2.com] ........................... [linktext2] 3. [www.gone3.com] ........................... [linktext3 ] 4. [www.gone3.com] ........................... [bold word] 5. [www.linkfromimage.com1] .................. [text next to image] 6. [www.linkfromimage.com2] .................. [] 7. [destinations/western-cape/map.aspx] ...... [abc] 8. [destinations/free-state/map.aspx] ........ [Free State Map] 9. [destinations/free-state/map2.aspx] ....... [Free State Map2]


ie, #4 and #6 must change, and #10 must go bye-bye.

Any pointers on what I'm doing wrong would be appreciated.

#!/usr/bin/perl # # Usage: # get_links_test {file.html} # # use strict; use warnings; use HTML::Parser; #---------------------------------------- # support variables for get_links(). my @get_links_result; my $get_links_txt_hdlr_text_maxsz; my $get_links_txt_hdlr_text_tallysz; my %get_links_txt_hdlr_text_inside; my @get_links_txt_hdlr_tags; sub strdef { my ($str) = @_; return '' if not defined $str; return $str; } #---------------------------------------- # HTML::Parser support routine for get_links(). # NB: Caller must check if inside tag. sub get_links_txt_stashit { my ( $inside, $element, $txt ) = @_; $txt = get_text_cleanit($txt); if ( ( defined $txt and $txt ne '' ) and $txt =~ m/\S+/ ) { # if this is <A>, and previous element was <HREF>, # then pop it off the stack and change it... # I'm sure there's a better way to achieve this. if ( $element eq 'a' ) { my $trec = pop @get_links_result; if ( defined $trec->{href} ) { # override some detail. $element = clean_url( $trec->{href} ); } else { # else push it back. push @get_links_result, $trec; } } my $rec = {}; $rec->{$element} = $txt; push @get_links_result, $rec; $get_links_txt_hdlr_text_tallysz += length $txt; } $inside->{$element} = 0; } sub get_text_cleanit { my ($txt) = @_; $txt =~ s/(?:^[[:punct:]]+|\s+|\n|\r)/ /gxm; $txt =~ s/(?:^\s+|^[[:punct:]]+|\s+$|[[:punct:]]+$)//gxm; return $txt; } sub clean_url { my $url = shift; $url =~ s#^http[s]*://##gxmi; $url =~ s#^ftp://##gxmi; $url =~ s#^gopher://##gxmi; $url =~ s#^news://##gxmi; $url =~ s#^file://##gxmi; $url =~ s#/$##gxm; return $url; } #---------------------------------------- # HTML::Parser support routine for get_links(). sub get_links_txt_hdlr { my ( $self, $txt ) = @_; # todo: something about doing this in a loop bothers me... # somehow, we need to know what the current event html tag is bein +g # processed, then we can do away with this loop... 'tagname' appe +ars to # be for start/end handlers only. foreach my $tag (@get_links_txt_hdlr_tags) { # skip these since they're handled by get_links_start_tag_hdlr +(). next if $tag eq 'img' or $tag eq 'href' or $tag eq 'meta'; get_links_txt_stashit( \%get_links_txt_hdlr_text_inside, $tag, + $txt ) if $get_links_txt_hdlr_text_inside{$tag}; # process if in +side tag. # exit early if got enough text. if ( $get_links_txt_hdlr_text_maxsz and $get_links_txt_hdlr_text_tallysz > $get_links_txt_hdlr_text_maxsz ) { $self->eof; last; } } } #---------------------------------------- # HTML::Parser support routine for get_links(). sub get_links_start_tag_hdlr { my ( $self, $tag, $attr ) = @_; $get_links_txt_hdlr_text_inside{$tag} = 1; if ( $tag eq 'a' and defined $attr->{href} ) { my $txt = get_text_cleanit( $attr->{href} ); return if not defined $txt or $txt eq ''; # skip links to email, javascript, etc. - this is not working. +.. if ( $attr->{href} =~ # [:;] to cope with erroneous HTML m/mailto[:;]|^javascript[:;]|^file[:;]|^news[:;]/i ) { return; } my $rec = {}; $rec->{href} = $txt; push @get_links_result, $rec; $get_links_txt_hdlr_text_tallysz += length $txt; } if ( $tag eq 'area' and defined $attr->{href} ) { my $rec = {}; my $txt = get_text_cleanit( $attr->{href} ); my $title = get_text_cleanit( strdef( $attr->{title} ) ); my $alt = get_text_cleanit( strdef( $attr->{alt} ) ); return if not defined $txt or $txt eq ''; my $content = strdef($title); $content = $alt if $content eq ''; $rec->{$txt} = $content; push @get_links_result, $rec; $get_links_txt_hdlr_text_tallysz += length $txt; } elsif ( $tag eq 'meta' and defined $attr->{'http-equiv'} and lc( $attr->{'http-equiv'} ) eq 'refresh' and defined $attr->{content} ) { my $tempurl = $attr->{content}; $tempurl =~ s#^\s*\d+\s*\;\s*url\s*\=\s*[']*(?:http[s]*://)*## +gi; $tempurl =~ s/[']+$//; my $txt = clean_url($tempurl); return if not defined $txt or $txt eq ''; my $rec = {}; $rec->{refresh} = $txt; push @get_links_result, $rec; $get_links_txt_hdlr_text_tallysz += length $txt; } } #---------------------------------------- # HTML::Parser support routine for get_links(). sub get_links_end_tag_hdlr { my ( $tag, $text ) = @_; $get_links_txt_hdlr_text_inside{$tag} = 0; } # new get_links() sub get_links_new { my ($fn) = @_; my $html_parser = HTML::Parser->new( api_version => 3 ); @get_links_txt_hdlr_tags = split '\s+', 'title h1 h2 h3 h4 h5 h6 a b em i ' . 'u caption th li dt dd blockquote ' . 'pre area img meta'; undef %get_links_txt_hdlr_text_inside; @get_links_result = (); $html_parser->handler( start => \&get_links_start_tag_hdlr, "self, tagname, attr" ); $html_parser->handler( end => \&get_links_end_tag_hdlr, "tagname, dtext" ); $html_parser->handler( text => \&get_links_txt_hdlr, "self,dtext" +); $html_parser->report_tags( split '\s+', 'a area meta' ); $html_parser->ignore_elements( split '\s+', 'script style comments +' ); eval { $html_parser->parse_file($fn); }; return @get_links_result; } #----------------------------------------------------------------- die "Usage: $0 {file.html}" if not defined $ARGV[0]; my @res = get_links_new( $ARGV[0] ); my $n = 0; for my $hash (@res) { for my $tag ( keys %$hash ) { printf( "%2d%s", $n++, ". [$tag] " . '.' x ( 40 - length($tag) ) . " [$hash->{$ta +g}]\n" ); } } ############################################# HTML file: <html> <meta name='description' content='this is some meta content'> <meta name='keywords' content='cars bikes sales call wheels engine fas +t'> <meta HTTP-EQUIV="Refresh" CONTENT="300;URL='http://web.asite.com/tmpg +ifs/zz/'"> <title>This is a title</title> <p align=left>this is some P aligned text</p> sfgsdfg sssss sjkfhsfh <div>this is some DIV wrapped text</div> <table><tr><td>now we have some text in a td/table</td></tr></table> this is bold text: <b>www.abc1.com</b> you can find it here: http://www.abc2.com/ you can find it here: www.abc3.com you can find it here: henry@bbbbabc3.com you can find it here: www.abc4.com you can find it here: www.abc5.com +vv www.abc.pl.net abc.com abc.pl.com/?a=1 123.com www.aaaa.info <a href=http://www.gone.com>linktext</a> <a href=http://www.gone2.com>linktext2</a> <a href=http://www.gone3.com>linktext3 (<b>bold word</b>)</a> mailto:bob@somwhere.com <img src=/xxx.png alt='alt text for img'> <a href=www.linkfromimage.com1>text next to image<img src=/xxx.png alt +='alt text for img'></a> <a href=www.linkfromimage.com2><img src=/xxx.png alt='alt text for img +'></a> <area title='abc' shape="rect" coords="121,380,172,400" href="/destina +tions/western-cape/map.aspx"> <area shape="rect" coords="262,214,301,241" href="/destinations/free-s +tate/map.aspx" title="Free State Map"> <area shape="rect" coords="262,214,301,241" href="/destinations/free-s +tate/map2.aspx" alt="Free State Map2"> <a href='mailto:deepheat@bbb.com'>email link text</a> </html>

Comment on HTML::Parser fun
Select or Download Code
Re: HTML::Parser fun
by Corion (Pope) on Jun 04, 2008 at 12:38 UTC

    I haven't read your code closely, but I get the feeling that you're trying to use HTML::Parser to extract text below tags, especially the text contained in <a> tags together with the href attribute. HTML::Parser is a rather unwieldly tool to extract text below tags in my opinion - I prefer to use XPath expressions for such tasks. Depending on what tools you have available, you might want to use XML::LibXML or Web::Scraper or HTML::TreeBuilder::XPath (which is what Web::Scraper uses) for running the XPath expressions. I write the link+text extraction using Web::Scraper:

    use strict; use Web::Scraper; use Data::Dumper; my $html = join "", <DATA>; # Invoked for a <a> tag my $link = scraper { process '//a' => 'href' => '@href'; process '//a' => 'description' => 'TEXT'; }; my $page = scraper { process '//a[@href]' => 'links[]' => $link; process '//meta[@http-equiv]' => 'meta[]' => '@content'; process '//area[@href]' => 'areas[]' => '@href'; }; my $info = $page->scrape($html); print Dumper $info; __DATA__ __DATA__ ... your html ...

    Update: After some quick browsing of the CPAN for XPath, I found XML::XPathEngine, by mirod, and quite unsurprisingly XML::Twig already understands XPath expressions. So it should be quite feasible to jury-rig Web::Scraper to use XML::Twig instead of HTML::TreeBuilder as the underlying parsing engine. Or you might just use XML::Twig directly.

      Thanks for the suggestions, will check it out. I'm using HTML::Parser for performance reasons. Everything else that I've tried is several orders of magnitude slower.

        Of course it's important to arrive at the wrong answer as fast as possible :). Most likely, the solutions are all slow because they load the HTML into the DOM, which is slow for large enough HTML files.

        On the other hand, I had to look at your output, because I couldn't follow your code for what you want to extract and what not. Your code hides the rules on what to extract quite deep, while the XPath expressions reduce the code mostly to the extraction rules and some boilerplate. Maybe you can keep the speed and gain some expressiveness by using a SAX-based parser like XML::Twig, which is meant for applying downward rules while not loading the whole document.

        I have no benchmarks but would logically expect XML::LibXML to be as fast or faster than HTML::Parser. They're both C and libxml is more mature with more eyeballs involved. The only issue I see is that while it can parse some broken HTML, it's not as flexible in that regard as HTML::Parser.

Re: HTML::Parser fun
by Fletch (Chancellor) on Jun 04, 2008 at 12:41 UTC

    Off hand:

    • Not posting the code and sample data here rather than off site
    • Not showing what you yourself have tried so far in resolving the problem
    • Mistakenly presuming that someone here is going to do your work for you for free1

    Update: Fwee, the wall of text was inlined.

    1. Unfortunately someone probably will, which will just prove that bad behavior gets results here fomenting more freeloading. But some people never learn.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Update: Fwee, the wall of text was inlined

      Which was quickly corrected when I realised my error, being new here.

      Not posting the code and sample data here rather than off site

      I was unsure how to go about it. Not wanting to offend with a long code-post, I thought it best to post it on my scratch-pad.

      Not showing what you yourself have tried so far in resolving the problem

      jeez, who peed in your tea? What else could I have posted to show what I've tried. Did you even bother to look what was posted before slithering over to comment?

      Mistakenly presuming that someone here is going to do your work for you for free

      That's a lot to assume about someone you don't know. My first post was in in the CB where I mentioned that I was prepared to pay for assistance. Someone then suggested I post it here first.

      1. Unfortunately someone probably will, which will just prove that bad behavior gets results here fomenting more freeloading. But some people never learn.

      What a nice introduction to PerlMonks. You've made my day. I'm going to cling to the belief that your sour and cynical personality type is not too common here...
Re: HTML::Parser fun
by wfsp (Abbot) on Jun 04, 2008 at 13:30 UTC
    Here's my stab at the parsing using HTML::TreeBuilder. You should be able to hook out what you need.
    #!/usr/bin/perl use strict; use warnings; use HTML::Treebuilder; use Data::Dumper; $Data::Dumper::Indent = 1; my $file_name = q{monk.html}; my $t = HTML::TreeBuilder->new; $t->parse_file($file_name); my @meta = $t->look_down( q{_tag}, q{meta}, ); print q{-} x 10, qq{\n}; print qq{meta\n}; print q{-} x 10, qq{\n}; for my $ele (@meta){ my %attr = $ele->all_external_attr; print Dumper \%attr; print q{-} x 10, qq{\n}; } my @links = $t->look_down( q{_tag}, q{a}, ); print q{-} x 10, qq{\n}; print qq{links\n}; print q{-} x 10, qq{\n}; for my $ele (@links){ print $ele->as_trimmed_text, qq{\n}; my %attr = $ele->all_external_attr; print Dumper \%attr; print q{-} x 10, qq{\n}; } my @areas = $t->look_down( q{_tag}, q{area}, ); print q{-} x 10, qq{\n}; print qq{areas\n}; print q{-} x 10, qq{\n}; for my $ele (@areas){ my %attr = $ele->all_external_attr; print Dumper \%attr; print q{-} x 10, qq{\n}; }
    output
    ---------- meta ---------- $VAR1 = { 'content' => 'this is some meta content', 'name' => 'description' }; ---------- $VAR1 = { 'content' => 'cars bikes sales call wheels engine fast', 'name' => 'keywords' }; ---------- $VAR1 = { 'http-equiv' => 'Refresh', 'content' => '300;URL=\'http://web.asite.com/tmpgifs/zz/\'' }; ---------- ---------- links ---------- linktext $VAR1 = { 'href' => 'http://www.gone.com' }; ---------- linktext2 $VAR1 = { 'href' => 'http://www.gone2.com' }; ---------- linktext3 (bold word) $VAR1 = { 'href' => 'http://www.gone3.com' }; ---------- text next to image $VAR1 = { 'href' => 'www.linkfromimage.com1' }; ---------- $VAR1 = { 'href' => 'www.linkfromimage.com2' }; ---------- email link text $VAR1 = { 'href' => 'mailto:deepheat@bbb.com' }; ---------- ---------- areas ---------- $VAR1 = { 'href' => '/destinations/western-cape/map.aspx', 'coords' => '121,380,172,400', 'shape' => 'rect', 'title' => 'abc' }; ---------- $VAR1 = { 'href' => '/destinations/free-state/map.aspx', 'coords' => '262,214,301,241', 'title' => 'Free State Map', 'shape' => 'rect' }; ---------- $VAR1 = { 'alt' => 'Free State Map2', 'href' => '/destinations/free-state/map2.aspx', 'coords' => '262,214,301,241', 'shape' => 'rect' }; ----------
      thanks for the info: I seem to recall testing HTML::Treebuilder and finding it lagging behind HTML::Parser in terms of performance (HTML::TokeParser::Simple was the worst performer, but easiest to use).

      Our problem is that that performance penalty really becomes a problem when we're processing hundreds of millions of files...

      Hence the choice of HTML::Parser. Now that I've got a taste of it's performance benefits, I'm loath to let go.
Re: HTML::Parser fun
by toolic (Chancellor) on Jun 04, 2008 at 14:03 UTC
    This is completely unrelated to your problem, but I thought that I'd offer a style point. Your split's could be replaced by the simpler qw. For example, this:
    $html_parser->report_tags( split '\s+', 'a area meta' );

    could also be coded as:

    $html_parser->report_tags( qw(a area meta) );
      Cool! I should remember that. My experience with perl is somewhat lacking.

      Speaking of which, what's up with the propensity of perl gurus to use qq{} and q{} so much when plain old '' or "" would do?
        ...what's up with the propensity of perl gurus to use qq{} and q{} so much when plain old '' or "" would do?

        I think it's mainly a scaling thing – unless your list contains only one or two items, qq{} will save your fingers a bunch of trips to the [Shift] and ['] keys. Sorry, misread the question in my haste. Please disregard.

        I can't speak for others, but here are a couple of the benefits I see.

        • Less escaping necessary. q{brace {fun}} does what I want. I can say q{"quote" 'fun'} too. You do need to escape braces in a q{} if they're unbalanced, but otherwise it just works.
        • It stands out a little more for small expressions (q{} vs '').
        • Even if I don't have to escape anything now, I might have to later. If I code with q{} today, it saves me having to change to it later.
        • Likewise, it's easier to go from q{} to qq{} than it is to go from '' to "" if my interpolation behavior needs to change.
      I just recalled: the split is a leftover from the origional code which reads something like:
      # this is actually read in from a conf file $conf{tags} = 'a area meta';
      with subsequent use as:
      $html_parser->report_tags( split '\s+', $conf{tags});
      qw// doesn't allow variable interpolation, is there an alternative syntax to using split to get the list?
        $conf{tags} = [ qw'a area meta ' ]; ...report_tags( @{ $conf{tags} } );
Re: HTML::Parser fun
by Your Mother (Canon) on Jun 04, 2008 at 19:26 UTC

    Just to show you how darn easy, accurate, deep, and fast some of this is with XML::LibXML...

    #!/usr/bin/perl use strict; use warnings; use XML::LibXML; # This is a shortcut, see the docs for more formal usage. my $doc = XML::LibXML->new->parse_html_fh(*DATA); my $root = $doc->getDocumentElement; my ( $head ) = $root->findnodes("head"); my ( $body ) = $root->findnodes("body"); print "Head stuff...\n"; for my $refresh ( $head->findnodes('meta[@http-equiv]') ) { print "\t", $refresh->getAttribute("content"), "\n"; } print "\nBody stuff...\n"; for my $link ( $body->findnodes('a[@href]') ) { printf("%25s --> %s\n", $link->textContent || $link->getAttribute("title") || "n/a" +, $link->getAttribute("href") ); } # print $doc->serialize(1); __DATA__ PUT YOUR HTML DOCUMENT DOWN HERE. Took it out for space.

    Reproducing the same output/report format you want is left as an excercise for the reader. :) The docs for the family of modules are terse but quite good once you see the big picture. There are options to allow more liberal/broken HTML to be parsed (or attempted anyway).

      Thanks! I'll give that a try to see how it stacks up against HTML::Parser when crunching a MBs of test HTML.
      Thanks for the info, Your Mother

      I've been testing XML::LibXML with various HTML files (our corpus has various sizes) to get some benchmarks, and I must say, it's surprisingly quick (except for really large files, which isn't really relevant in my case), however:
      • this is a deal-killer: the HTML must be balanced with nice </x> closing tags (which it's often not in the real world), else it croaks without producing any output (HTML::Parser tolerates this kind of thing).
      HTML::Parser soldiers on despite missing tags, etc, and still produces useful output (required in our app).

      Some (unscientific) benchmarks:

      104KB HTML file processed 100 times (average of 3 runs)
      HTML::Parser: ~20s
      XML::LibXML: ~13s

      371KB HTML file processed 100 times
      HTML::Parser: ~51s
      XML::LibXML: ~30s

      550KB HTML file processed 100 times
      HTML::Parser: ~73s
      XML::LibXML: ~49s

      4.3MB HTML file processed once (silly, but interesting in a huh? kind of way)
      HTML::Parser: ~4s
      XML::LibXML: ~85s

      Conclusion: it looks like XML::LibXML is the way to go. My only concern (the reason preventing me from switching over to XML::LibXML) is how to get it to be tolerant of lazy/broken HTML the way HTML::Parser is.

      I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test how tolerant it is).
        I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test it and see how tolerant it is).

        You can't. At least not in Perl. XML::LibXML uses libxml2, which does the XML, and HTML, parsing. That's what you would need to change.

        For the record, when I wanted to add HTML parsing to XML::Twig, I looked at HTML::Parser, XML::LibXML and tidy, and settled on HTML::Parser as the most robust and easy to use solution to get well-formed XML out of random HTML.

        Sorry I didn't include it in the first round. I had to look it up in the parser doc under the html options; XML::LibXML::Parser. There are other options but recover is probably what you need (recover_silently does the same without any warnings to STDERR). It's an argument to new or a method.

        # file named 'libxml-html-forgiving' use warnings; use strict; use XML::LibXML; my $corpus = join "", <DATA>; my $parser = XML::LibXML->new(); # give command line an argument to hide errors @ARGV ? $parser->recover_silently(1) : $parser->recover(1); my $doc = $parser->parse_html_string($corpus); print "-" x 60, "\n"; print "parse_html rendered with serialize_html\n"; print "-" x 60, "\n"; print $doc->serialize_html(); print "-" x 60, "\n"; print "parse rendered with serialize_html\n"; print "-" x 60, "\n"; my $doc2 = $parser->parse_string($corpus); print $doc2->serialize_html(); __END__ <p> Some HTML & a <b>problem with it > normal but deadly; <p>

        Then run with an arg to suppress errors (which are going to STDERR so they don't interfere with real output either way)-

        moo@cow[48]~/bin>perl libxml-html-forgiving 1 ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ <p> Some HTML a problem with it &gt; normal but deadly; </p>

        Or without an arg to see all the feedback-

        moo@cow[49]~/bin>perl libxml-html-forgiving HTML parser error : htmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ :2: parser error : xmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ :4: parser error : Premature end of data in tag p line 3 ^ :4: parser error : Premature end of data in tag b line 2 ^ :4: parser error : Premature end of data in tag p line 1 ^ <p> Some HTML a problem with it &gt; normal but deadly; </p>

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://690119]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (15)
As of 2014-07-22 14:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (115 votes), past polls