opinion - enormous input files

donkeykong has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: opinion - enormous input files by BrowserUk (Patriarch) on Feb 01, 2009 at 04:42 UTC
It doesn't really matter how large the file is, but rather how many unique strings it contains. Something as simple as this: `perl -nle"++$h{ $1 } while m[(\S+)]g } { print qq[$_ : $h{ $_ }] for keys %h" hugeFile` [download] will work quite well for any size of file if there are no more than a few (low) 10s of millions of unique strings to count. If the number of unique strings is larger than that, then you will likely run out of memory constructing the hash. The other way to go, is to use your systems sort utility to order the strings, you can then process the sorted file, line by line, and count the consecutive matches and output your counts without having to build a data structure to hold them all. If there are multiple strings per line, pre-process the file, line by line, and split the lines into strings and output them one per line. Then feed that to your system sort (with -u if it supports it). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: opinion - enormous input files by artist (Parson) on Feb 01, 2009 at 04:37 UTC
If on *nix you can use `sort file \| uniq -c` [download] --Artist	[reply] [d/l]
Re^2: opinion - enormous input files by dsheroh (Monsignor) on Feb 01, 2009 at 14:02 UTC
No, he wants to do it in perl: perl -e 'print `sort file \| uniq -c`' [download]	[reply] [d/l]
Re: opinion - enormous input files by Lawliet (Curate) on Feb 01, 2009 at 04:24 UTC
I would read each line one at a time and use a hash to keep track of how many occurrences there are for each string. And you didn't even know bears could type.	[reply]
Re: opinion - enormous input files by dsheroh (Monsignor) on Feb 01, 2009 at 14:11 UTC
You've gotten good answers already regarding how to do the actual count of unique strings, but none have directly addressed the question of "Would you evaluate each line as you are reading the file, or would you read the whole file into an array or hash and then evaluate, or would you do it another way?" As a general rule, any time you want to go over a file and just do one thing to each line, your best option will typically be to evaluate each line as it is read. Reading it all into an array and then walking through the array to evaluate each line would just waste time (since you're making two passes over the data, even if you're only reading the file once) and memory (since you need to store the entire file in memory).	[reply]
Re: opinion - enormous input files by weismat (Friar) on Feb 01, 2009 at 06:57 UTC
Another trick could be to work with a memory disk (Windows) (external program) or swap on *ix system. This can work very efficient if the files fit into the memory.<br	[reply]