Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Here are a few suggestions based on experience dealing with processing lots of very large log files (not a perfect match, but a lot of which would apply here).

  • First suggestion, which I think I saw that you've now tested, is to try splitting the logic into separate programs. With multiple CPU's, if you your processing is CPU bound, splitting the work into multiple piped programs can allow you to take advantage of those CPUs, and will frequently outperform a single-app solution.

  • I also saw mention that it was a large file, 80GB or something like that? If it doesn't get updated frequently, you might want to try compressing the file. Using a fast compression algorithm (gzip, lzo, etc.) you can often read and decompress data faster than you could read the uncompressed data from disk. With multiple CPU's, this can result in a net improvement in performance. This is worth testing, of course, as it will depend heavily on your specific circumstances (disk speed, CPU speed, RAM and disk caching, etc.)

  • Another possible suggestion, depending on many hardware/system factors, would be to split it up, along the lines of Map-Reduce. Split your file (physically, or logically), and process the chunks with separate programs, then combine the results. A naive example might be a program that gets the file size, breaks it into 10GB chunks, then forks into a corresponding number of child processes, each of which does work on it's chunk of the file, while returning the results to be aggregated at the end.

  • If you don't need to return a result set for every single record in the file every time, then the suggestion to try a real database is an excellent one. I love SQLite, and it can handle quite a bit (although 80GB might be pushing it), but if you only wanted to return results for a smaller matching subset of the data, you're almost certainly going to win big with a database.

  • If you really wanted to squeeze every bit of performance out of this, and optimize it to the extreme, you'd do well off to read about the Wide Finder Projects/Experiments kicked off by Tim Bray:

    It's worth googling for "Wide Finder" and checking out some of the other write-ups discussing it, too.

Much as I love Perl, I probably would have done something like this as a first shot for the described processing:

$ zcat datafile.gz | awk -F'\t' '{print $3,$1,$6}' | gzip -c > output.gz


In reply to Re: selecting columns from a tab-separated-values file by topher
in thread selecting columns from a tab-separated-values file by ibm1620

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-23 14:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found