Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

opinion - enormous input files

by donkeykong (Novice)
on Feb 01, 2009 at 04:04 UTC ( [id://740501]=perlquestion: print w/replies, xml ) Need Help??

donkeykong has asked for the wisdom of the Perl Monks concerning the following question:

Dear all that can provide feedback, I have a question. In working with an enormous file with each line made of random characters (ex. abc123), my task is to read all the data just once, and print out all the unique strings and number of times they appear (so if abc123 shows up 1,000 times, the output would say "abc123: 1,000"). So my question is, in all your opinions, what is the most efficient and fastest way of doing this? Would you evaluate each line as you are reading the file, or would you read the whole file into an array or hash and then evaluate, or would you do it another way? I appreciate your input.

Replies are listed 'Best First'.
Re: opinion - enormous input files
by BrowserUk (Patriarch) on Feb 01, 2009 at 04:42 UTC

    It doesn't really matter how large the file is, but rather how many unique strings it contains.

    Something as simple as this:

    perl -nle"++$h{ $1 } while m[(\S+)]g } { print qq[$_ : $h{ $_ }] for keys %h" hugeFile

    will work quite well for any size of file if there are no more than a few (low) 10s of millions of unique strings to count.

    If the number of unique strings is larger than that, then you will likely run out of memory constructing the hash.

    The other way to go, is to use your systems sort utility to order the strings, you can then process the sorted file, line by line, and count the consecutive matches and output your counts without having to build a data structure to hold them all.

    If there are multiple strings per line, pre-process the file, line by line, and split the lines into strings and output them one per line. Then feed that to your system sort (with -u if it supports it).


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: opinion - enormous input files
by artist (Parson) on Feb 01, 2009 at 04:37 UTC
    If on *nix you can use
    sort file | uniq -c
    --Artist
      No, he wants to do it in perl:
      perl -e 'print `sort file | uniq -c`'
Re: opinion - enormous input files
by Lawliet (Curate) on Feb 01, 2009 at 04:24 UTC

    I would read each line one at a time and use a hash to keep track of how many occurrences there are for each string.

    And you didn't even know bears could type.

Re: opinion - enormous input files
by dsheroh (Monsignor) on Feb 01, 2009 at 14:11 UTC
    You've gotten good answers already regarding how to do the actual count of unique strings, but none have directly addressed the question of "Would you evaluate each line as you are reading the file, or would you read the whole file into an array or hash and then evaluate, or would you do it another way?"

    As a general rule, any time you want to go over a file and just do one thing to each line, your best option will typically be to evaluate each line as it is read. Reading it all into an array and then walking through the array to evaluate each line would just waste time (since you're making two passes over the data, even if you're only reading the file once) and memory (since you need to store the entire file in memory).

Re: opinion - enormous input files
by weismat (Friar) on Feb 01, 2009 at 06:57 UTC
    Another trick could be to work with a memory disk (Windows) (external program) or swap on *ix system.
    This can work very efficient if the files fit into the memory.<br

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://740501]
Approved by Lawliet
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-24 06:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found