|Problems? Is your data what you think it is?|
Re^2: In-place sort with order assignmentby BrowserUk (Pope)
|on Sep 21, 2010 at 15:07 UTC||Need Help??|
Could I perchance repeat my suggestion of, “do it the COBOL way?”
No. I read you the first time.
If you read the rest of the thread, you'd see that option has already been tested and comprehensively dismissed as a possibility.
As for all the "old-timer" logic. Been there, done that. (Don't ever wanna go back:).
In the mid-80's I sped up the sorting of 6 million University entrance examination results on twinned PDP-11/24s from 10 days to just over a day (by reversing the order of the joins involved).
Today, I've sorted 60% more records, in perl, in 108 seconds.
I can't start work with a sorted file, because the input file needs to be pre-processed to extract the 1 billion words from the 100 million lines. Those 1 billion words need to be uniq'd to reduce their volume by roughly 3 orders of magnitude before sorting.
I could (externally) sort the 1 billion words; then uniq them, but that means O(N log N) where N is 1 billion, rather than O(N log N) where N = 1 million. Now assuming that each record takes 0.0000113 to sort--that's based upon my Perl sort test, no external sort is going to achieve that--the trade of sorting before uniq'ing versus after is:
Now, those per item timings are based upon an in-memory sort. To get a true picture we'd need to use figures from an external sort. So, I sorted 1 million items and the sort time was 0.00001535; then I sort 10 million items and the time was 0.00002198. That's a 43% increase in time per item, for ten times the data.
So for 100 million, say 0.00003143. And for 1 billion say: 0.00004494. Which means externally sorting 1 billion items is likely to be closer to 5 days than 4. I hope you now see that uniq'ing before the sort is imperative.
Just as not everything new is better than the old, vice versa is also true.
It's simple physics. If you squeeze a large volume of data (~10GB), through a small hole (~1MB ram as used by GNU sort), you have to shuffle it on and off disc many times. "memory may be a disc too", but it's bloody fast compared to real disc. Conversely, disc makes bloody lousy memory.
“Don’t diddle the code to make it faster: choose a better algorithm.”
Uniq'ing then sorting in memory: 108 seconds.
Pre-external sorting and then uniq'ing: 4.68125 days.
Which do you see as the better algorithm?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.