It doesn't really matter how large the file is, but rather how many unique strings it contains.
Something as simple as this:
perl -nle"++$h{ $1 } while m[(\S+)]g }
{ print qq[$_ : $h{ $_ }] for keys %h" hugeFile
will work quite well for any size of file if there are no more than a few (low) 10s of millions of unique strings to count.
If the number of unique strings is larger than that, then you will likely run out of memory constructing the hash.
The other way to go, is to use your systems sort utility to order the strings, you can then process the sorted file, line by line, and count the consecutive matches and output your counts without having to build a data structure to hold them all.
If there are multiple strings per line, pre-process the file, line by line, and split the lines into strings and output them one per line. Then feed that to your system sort (with -u if it supports it).
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|