Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?

I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?

Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".

No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.

svn commit (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)

Now, install Devel::NTYProf, and run perl -d:NYTProf yourscript.pl file1.html followed by nytprofhtml. Open nytprof/index.html and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.

Repeat until you find no more code to optimize.

Repeat with several other HTML files.

Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that XML::LibXML was more than 10 times faster than XML::Twig with my problem and my data.)

Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.

Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the parse_html_file() method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.

<update>If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.</update>

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re: how to quickly parse 50000 html documents? by afoken
in thread how to quickly parse 50000 html documents? by brengo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (16)
    As of 2014-08-27 13:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (238 votes), past polls