|laziness, impatience, and hubris|
Re: how to quickly parse 50000 html documents?by afoken (Parson)
|on Nov 26, 2010 at 10:48 UTC||Need Help??|
Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".
No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.
svn commit (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)
Now, install Devel::NTYProf, and run perl -d:NYTProf yourscript.pl file1.html followed by nytprofhtml. Open nytprof/index.html and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.
Repeat until you find no more code to optimize.
Repeat with several other HTML files.
Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that XML::LibXML was more than 10 times faster than XML::Twig with my problem and my data.)
Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.
Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the parse_html_file() method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.
<update>If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.</update>
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)