good chemistry is complicated,
and a little bit messy -LW
Re^2: HTML::Parser funby FreakyGreenLeaky (Sexton)
|on Jun 05, 2008 at 14:54 UTC||Need Help??|
I've been testing XML::LibXML with various HTML files (our corpus has various sizes) to get some benchmarks, and I must say, it's surprisingly quick (except for really large files, which isn't really relevant in my case), however:
Some (unscientific) benchmarks:
104KB HTML file processed 100 times (average of 3 runs)
371KB HTML file processed 100 times
550KB HTML file processed 100 times
4.3MB HTML file processed once (silly, but interesting in a huh? kind of way)
Conclusion: it looks like XML::LibXML is the way to go. My only concern (the reason preventing me from switching over to XML::LibXML) is how to get it to be tolerant of lazy/broken HTML the way HTML::Parser is.
I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test how tolerant it is).