Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^2: HTML::Parser fun

by FreakyGreenLeaky (Sexton)
on Jun 05, 2008 at 14:54 UTC ( [id://690453]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML::Parser fun
in thread HTML::Parser fun

Thanks for the info, Your Mother

I've been testing XML::LibXML with various HTML files (our corpus has various sizes) to get some benchmarks, and I must say, it's surprisingly quick (except for really large files, which isn't really relevant in my case), however:
  • this is a deal-killer: the HTML must be balanced with nice </x> closing tags (which it's often not in the real world), else it croaks without producing any output (HTML::Parser tolerates this kind of thing).
HTML::Parser soldiers on despite missing tags, etc, and still produces useful output (required in our app).

Some (unscientific) benchmarks:

104KB HTML file processed 100 times (average of 3 runs)
HTML::Parser: ~20s
XML::LibXML: ~13s

371KB HTML file processed 100 times
HTML::Parser: ~51s
XML::LibXML: ~30s

550KB HTML file processed 100 times
HTML::Parser: ~73s
XML::LibXML: ~49s

4.3MB HTML file processed once (silly, but interesting in a huh? kind of way)
HTML::Parser: ~4s
XML::LibXML: ~85s

Conclusion: it looks like XML::LibXML is the way to go. My only concern (the reason preventing me from switching over to XML::LibXML) is how to get it to be tolerant of lazy/broken HTML the way HTML::Parser is.

I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test how tolerant it is).

Replies are listed 'Best First'.
Re^3: HTML::Parser fun
by Your Mother (Archbishop) on Jun 05, 2008 at 16:15 UTC

    Sorry I didn't include it in the first round. I had to look it up in the parser doc under the html options; XML::LibXML::Parser. There are other options but recover is probably what you need (recover_silently does the same without any warnings to STDERR). It's an argument to new or a method.

    # file named 'libxml-html-forgiving' use warnings; use strict; use XML::LibXML; my $corpus = join "", <DATA>; my $parser = XML::LibXML->new(); # give command line an argument to hide errors @ARGV ? $parser->recover_silently(1) : $parser->recover(1); my $doc = $parser->parse_html_string($corpus); print "-" x 60, "\n"; print "parse_html rendered with serialize_html\n"; print "-" x 60, "\n"; print $doc->serialize_html(); print "-" x 60, "\n"; print "parse rendered with serialize_html\n"; print "-" x 60, "\n"; my $doc2 = $parser->parse_string($corpus); print $doc2->serialize_html(); __END__ <p> Some HTML & a <b>problem with it > normal but deadly; <p>

    Then run with an arg to suppress errors (which are going to STDERR so they don't interfere with real output either way)-

    moo@cow[48]~/bin>perl libxml-html-forgiving 1 ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ <p> Some HTML a problem with it &gt; normal but deadly; </p>

    Or without an arg to see all the feedback-

    moo@cow[49]~/bin>perl libxml-html-forgiving HTML parser error : htmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ ------------------------------------------------------------ parse_html rendered with serialize_html ------------------------------------------------------------ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Some HTML &amp; a <b>problem with it &gt; normal but deadly; <p></p></b></p></body></html> ------------------------------------------------------------ parse rendered with serialize_html ------------------------------------------------------------ :2: parser error : xmlParseEntityRef: no name Some HTML & a <b>problem with it > normal but deadly; ^ :4: parser error : Premature end of data in tag p line 3 ^ :4: parser error : Premature end of data in tag b line 2 ^ :4: parser error : Premature end of data in tag p line 1 ^ <p> Some HTML a problem with it &gt; normal but deadly; </p>
      Thank you very much, Your Mother, I must have glossed over that bit in the docs with bleary eyes glassified by bashing my head against the trees of the forest... or something like that.

      I'm going to play around with this over the weekend to get comfy with the idea and if all goes well, it looks like I'll be retooling with XML::LibXML.

      Even though I didn't get my origional question answered about HTML::Parser, it looks like I've learnt something new and better!

        You're most welcome. I don't know if XML::LibXML's a cure-all but it's all I've been using for a couple years for parsing (X)HTML when I don't need a stream (which is most of the time, otherwise I like HTML::TokeParser). It'll even validate documents against DTDs. And as a side-effect of picking it up, you'll find you'll learn other useful stuff like xpath and JS/DOM hacking. Mine improved considerably though learning it.

Re^3: HTML::Parser fun
by mirod (Canon) on Jun 05, 2008 at 15:05 UTC
    I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test it and see how tolerant it is).

    You can't. At least not in Perl. XML::LibXML uses libxml2, which does the XML, and HTML, parsing. That's what you would need to change.

    For the record, when I wanted to add HTML parsing to XML::Twig, I looked at HTML::Parser, XML::LibXML and tidy, and settled on HTML::Parser as the most robust and easy to use solution to get well-formed XML out of random HTML.

      Yes, creamygoodness put me onto HTML::Parser some time ago, and I'm finding it hard to look back.

      I then wonder why Your Mother suggested "There are options to allow more liberal/broken HTML to be parsed (or attempted anyway)."?

      I wonder what options he/she was referring to?

      Any idea?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://690453]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-25 07:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found