Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Here are a few suggestions based on experience dealing with processing lots of very large log files (not a perfect match, but a lot of which would apply here).

  • First suggestion, which I think I saw that you've now tested, is to try splitting the logic into separate programs. With multiple CPU's, if you your processing is CPU bound, splitting the work into multiple piped programs can allow you to take advantage of those CPUs, and will frequently outperform a single-app solution.

  • I also saw mention that it was a large file, 80GB or something like that? If it doesn't get updated frequently, you might want to try compressing the file. Using a fast compression algorithm (gzip, lzo, etc.) you can often read and decompress data faster than you could read the uncompressed data from disk. With multiple CPU's, this can result in a net improvement in performance. This is worth testing, of course, as it will depend heavily on your specific circumstances (disk speed, CPU speed, RAM and disk caching, etc.)

  • Another possible suggestion, depending on many hardware/system factors, would be to split it up, along the lines of Map-Reduce. Split your file (physically, or logically), and process the chunks with separate programs, then combine the results. A naive example might be a program that gets the file size, breaks it into 10GB chunks, then forks into a corresponding number of child processes, each of which does work on it's chunk of the file, while returning the results to be aggregated at the end.

  • If you don't need to return a result set for every single record in the file every time, then the suggestion to try a real database is an excellent one. I love SQLite, and it can handle quite a bit (although 80GB might be pushing it), but if you only wanted to return results for a smaller matching subset of the data, you're almost certainly going to win big with a database.

  • If you really wanted to squeeze every bit of performance out of this, and optimize it to the extreme, you'd do well off to read about the Wide Finder Projects/Experiments kicked off by Tim Bray:

    It's worth googling for "Wide Finder" and checking out some of the other write-ups discussing it, too.

Much as I love Perl, I probably would have done something like this as a first shot for the described processing:

$ zcat datafile.gz | awk -F'\t' '{print $3,$1,$6}' | gzip -c > output.gz

In reply to Re: selecting columns from a tab-separated-values file by topher
in thread selecting columns from a tab-separated-values file by ibm1620

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (3)
    As of 2020-10-01 16:20 GMT
    Find Nodes?
      Voting Booth?
      My favourite web site is:

      Results (16 votes). Check out past polls.