Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?

I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?

Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".

No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.

svn commit (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)

Now, install Devel::NTYProf, and run perl -d:NYTProf file1.html followed by nytprofhtml. Open nytprof/index.html and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.

Repeat until you find no more code to optimize.

Repeat with several other HTML files.

Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that XML::LibXML was more than 10 times faster than XML::Twig with my problem and my data.)

Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.

Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the parse_html_file() method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.

<update>If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.</update>


Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re: how to quickly parse 50000 html documents? by afoken
in thread how to quickly parse 50000 html documents? by brengo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the sunlight beams...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (4)
    As of 2018-05-23 08:13 GMT
    Find Nodes?
      Voting Booth?