Your skill will accomplish what the force of many cannot |
|
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Thanks for the info, Your Mother
I've been testing XML::LibXML with various HTML files (our corpus has various sizes) to get some benchmarks, and I must say, it's surprisingly quick (except for really large files, which isn't really relevant in my case), however:
Some (unscientific) benchmarks: 104KB HTML file processed 100 times (average of 3 runs) HTML::Parser: ~20s XML::LibXML: ~13s 371KB HTML file processed 100 times HTML::Parser: ~51s XML::LibXML: ~30s 550KB HTML file processed 100 times HTML::Parser: ~73s XML::LibXML: ~49s 4.3MB HTML file processed once (silly, but interesting in a huh? kind of way) HTML::Parser: ~4s XML::LibXML: ~85s Conclusion: it looks like XML::LibXML is the way to go. My only concern (the reason preventing me from switching over to XML::LibXML) is how to get it to be tolerant of lazy/broken HTML the way HTML::Parser is. I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test how tolerant it is). In reply to Re^2: HTML::Parser fun
by FreakyGreenLeaky
|
|