Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: HTML::Parser fun

by FreakyGreenLeaky (Sexton)
on Jun 04, 2008 at 12:47 UTC ( #690127=note: print w/ replies, xml ) Need Help??


in reply to Re: HTML::Parser fun
in thread HTML::Parser fun

Thanks for the suggestions, will check it out. I'm using HTML::Parser for performance reasons. Everything else that I've tried is several orders of magnitude slower.


Comment on Re^2: HTML::Parser fun
Re^3: HTML::Parser fun
by Corion (Pope) on Jun 04, 2008 at 12:55 UTC

    Of course it's important to arrive at the wrong answer as fast as possible :). Most likely, the solutions are all slow because they load the HTML into the DOM, which is slow for large enough HTML files.

    On the other hand, I had to look at your output, because I couldn't follow your code for what you want to extract and what not. Your code hides the rules on what to extract quite deep, while the XPath expressions reduce the code mostly to the extraction rules and some boilerplate. Maybe you can keep the speed and gain some expressiveness by using a SAX-based parser like XML::Twig, which is meant for applying downward rules while not loading the whole document.

      Hmm, XML::Twig looks interesting, thanks!

      HTML::Parser is probably overkill for this simple task. I use it elsewhere to extract all HTML tags and their content, etc, and there it's performance is excellent (we're processing hundreds of millions of HTML docs, hence my need for speed).
Re^3: HTML::Parser fun
by Your Mother (Canon) on Jun 04, 2008 at 18:58 UTC

    I have no benchmarks but would logically expect XML::LibXML to be as fast or faster than HTML::Parser. They're both C and libxml is more mature with more eyeballs involved. The only issue I see is that while it can parse some broken HTML, it's not as flexible in that regard as HTML::Parser.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://690127]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2014-09-16 04:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (155 votes), past polls