note
afoken
<blockquote><p>So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?</p><p>I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?</p></blockquote>
<p>Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".</p>
<p>No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.</p>
<p><c>svn commit</c> (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)</p>
<p>Now, install [mod://Devel::NTYProf], and run <c>perl -d:NYTProf yourscript.pl file1.html</c> followed by <c>nytprofhtml</c>. Open <c>nytprof/index.html</c> and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.</p>
<p>Repeat until you find no more code to optimize.</p>
<p>Repeat with several other HTML files.</p>
<p>Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that [mod://XML::LibXML] was more than 10 times faster than [mod://XML::Twig] with <b>my</b> problem and <b>my</b> data.)</p>
<p>Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.</p>
<p>Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the <c>parse_html_file()</c> method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.</p>
<p><update>If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.</update></p>
<p>Alexander</p>
<div class="pmsig"><div class="pmsig-747201">
--<br>
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
</div></div>
873713
873713