<?xml version="1.0" encoding="windows-1252"?>
<node id="873828" title="Re: how to quickly parse 50000 html documents?" created="2010-11-26 05:48:40" updated="2010-11-26 05:48:40">
<type id="11">
note</type>
<author id="747201">
afoken</author>
<data>
<field name="doctext">
&lt;blockquote&gt;&lt;p&gt;So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?&lt;/p&gt;&lt;p&gt;I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".&lt;/p&gt;
&lt;p&gt;No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.&lt;/p&gt;
&lt;p&gt;&lt;c&gt;svn commit&lt;/c&gt; (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)&lt;/p&gt;
&lt;p&gt;Now, install [mod://Devel::NTYProf], and run &lt;c&gt;perl -d:NYTProf yourscript.pl file1.html&lt;/c&gt; followed by &lt;c&gt;nytprofhtml&lt;/c&gt;. Open &lt;c&gt;nytprof/index.html&lt;/c&gt; and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.&lt;/p&gt;
&lt;p&gt;Repeat until you find no more code to optimize.&lt;/p&gt;
&lt;p&gt;Repeat with several other HTML files.&lt;/p&gt;
&lt;p&gt;Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that [mod://XML::LibXML] was more than 10 times faster than [mod://XML::Twig] with &lt;b&gt;my&lt;/b&gt; problem and &lt;b&gt;my&lt;/b&gt; data.)&lt;/p&gt;
&lt;p&gt;Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.&lt;/p&gt;
&lt;p&gt;Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the &lt;c&gt;parse_html_file()&lt;/c&gt; method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.&lt;/p&gt;
&lt;p&gt;&amp;lt;update&amp;gt;If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.&amp;lt;/update&amp;gt;&lt;/p&gt;
&lt;p&gt;Alexander&lt;/p&gt;
&lt;div class="pmsig"&gt;&lt;div class="pmsig-747201"&gt;
--&lt;br&gt;
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
873713</field>
<field name="parent_node">
873713</field>
</data>
</node>
