Re^3: sorting type question- space problemsby BrowserUk (Pope)
|on Sep 15, 2013 at 13:55 UTC||Need Help??|
Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(
Actually, it is. Or at least, it may be so ... if the lengths of the keys in your OP is somewhat representative of your real data; and you have a couple of GB of ram available.
The lengths of the records is immaterial; the 5 million records can be represented in memory by the two keys + an (64-bit) integer offset of the start of each record's position within the file. If the two keys are less than ~8 bytes each, then the total memory requirement for 5 million anonymous subs each containing 2 keys + file offet is ~ 1.8GB.
This code builds an index of the two keys + file offset in an AoA; sorts that AoA in memory and then writes the output file by seeking the input file, reading the appropriate record and writing it to the output file. For a 5 million record file it takes a little over 2 minutes on my machine:
Science is about questioning the status quo. Questioning authority
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.