in reply to "Just use a hash": An overworked mantra?
When you are dealing with “huge” amounts of data, everything depends upon ... memory. Do you have it, or do you not. (And if it’s “virtual,” you don’t have it.)
Many computers these days have truly vast amounts of RAM and there are a roomful of computers thusly equipped. Under these circumstances, the chances are quite good that a tremendous structure can be built in RAM and that all of those pages will be (and will remain) present. If that is known to be the case, then “in-memory” solutions work just fine, and yes they do behave very nicely as BrowserUK points out (with his characteristic I-love big-fonts flair ...).
What becomes truly insidious about “in-memory” solutions, especially those based upon random-access data structures such as hashes, is when virtual-memory is constrained such that the entire block of data cannot fit into available physical RAM without incurring page faults. A hash data-structure does not exhibit any locality of reference; quite the opposite. Any reference to that structure could (worst-case) incur a page-fault, which suddenly transforms the entire algorithm from what you think is a fast, virtually I/O-free operation, into one that hammers your paging-device to death and brings the entire system to a screeching halt along with it.
If you plot the performance curve of a virtual-storage system as the stress which is placed upon it increases, you will observe a line that basically increases in a nice, more-or-less linear fashion u-n-t-i-l it “hits the wall,” the so-called thrash point. At this instant, the performance curve suddenly becomes exponential. And that, as I’ve said before (from Ghostbusters), is “real wrath-of-God stuff.”
BrowserUK is therefore entirely correct as long as you are well away from the thrash-point. (And today, you might well be able to “throw cheap silicon at it” and thereby avoid the thrash-point entirely. There is a reason why we have 64-bit systems now; soon to be 128. Chips are cheap.) But the punishment that can be inflicted, when and if it happens, is severe because it is exponential.
In passing ... it is quite interesting that sorting a multi-million record file should take “ten minutes,” which is quite inexcusable. There are interesting-looking articles here and also here. Also specifically to our point, A Fresh Look at Efficient Perl Sorting, although it does not concern disk-sorts.
A similar situation can happen with regard to accessing indexed files. Once again we are dealing with a random-access data structure which may require some n physical I/O operations to retrieve the data, and which rewards locality-of-reference by virtue of cacheing recently-used index pages in RAM while discarding others. Once again we have the “thrashing” phenomenon, albeit of a different kind and source. Plentiful memory tends to mask the problem once again. (Operating systems will dedicate leftover memory to file-buffering when there is no other competition for the space.)
When and if you hit a thrash-point problem, you will know. The difference can be a matter of many hours, or the difference between a job that finishes and one that does not. “Ten minutes” (or more...) becomes an acceptable price to pay when for example you are talking about a massive runs-through-the-night production batch job. And those, really, are the kind of situations I am talking about. Not the size of problem that can be effectively dealt-with by buying more chips. Obviously, “if you’ve got the RAM, flaunt it.”