|Syntactic Confectionery Delight|
Re: Large file, multi dimensional hash - out of memoryby sundialsvc4 (Abbot)
|on May 16, 2013 at 00:15 UTC||Need Help??|
DavidO’s suggestion seems the most-appropriate one to me ... such that “out of memory” ought never be an issue here.
If the file is known to be sorted by filename, then the entire set of records for any given name is known to be adjacent, and you only need to remember “the last name seen” and a hash to accumulate and/or to count extensions. When the name changes, the results for the previous current-name are output, the new current-name is captured, and the hash is reset ... taking proper care to ensure that nothing is lost from either group.
As an added bonus, the output file will always emerge sorted the same way. So, you can devise an entire processing sequence that relies upon sort-order, and every single one of them might be “very fast and virtually RAM-free.” You can furthermore rely upon the fact that, for any gaps that you detect in the sequence, there will be no records anywhere in the file that fall within that gap.
This is a very powerful, and very old, technique that was used as far back as Herman Hollerith’s tabulation of the US Census using punched cards. Especially when large amounts of data had to be processed on machines whose memory capacity was all but non-existent, the so-called “unit-record processing” was the only way such jobs could be done. Even today, the assumption (and maintaining) of a certain known sort-order is a powerful technique that is ideal for an application such as this. If you know that the file is already sorted, you don’t need much memory at all. (If you knew that it was sorted both by name and extension ... and you should check for this, because there’s a pretty good chance that it’s so!! ... your problem might require almost-no memory at all: the files are adjacent, and within each file, so are the extensions.)
When you are dealing with data files of this magnitude, and especially if you need to do several things to the data in successive steps, it might (or, might not) be justifiable to sort the data, just to pick-up an accumulation of efficiencies downstream. (The only way to establish that is to empirically test it, on your own hardware and network.) In any case, if the producer of that file gave it to you with a guaranteed sort-order, they just dealt you a huge favor.
In any case, your code must be defensive: if your algorithm fundamentally relies upon the proposition that the data is sorted, you must not assume. You’re looking for the key-values to change, and that the difference between old and new is “greater than.” If it is not, your code must detect it and die ... because the results won’t be reliable and no one else is in the position to know.
Note: On a Unix-ish system, the shell command sort -c filename can be used to quickly discover, once and for all, whether the entire file is or is not sorted. (It can also be used as a quick “sanity-check” in a script that invokes this Perl program, since it does understand space-delimited files and can be instructed to consider the entire file or only a particular key or keys. “Sniff” the data at the start of the job, and sound the alarm before eating spoiled meat, because any human can make a mistake.)