http://www.perlmonks.org?node_id=861087


in reply to Re: In-place sort with order assignment
in thread In-place sort with order assignment

Could I perchance repeat my suggestion of, “do it the COBOL way?”

No. I read you the first time.

If you read the rest of the thread, you'd see that option has already been tested and comprehensively dismissed as a possibility.

As for all the "old-timer" logic. Been there, done that. (Don't ever wanna go back:).

In the mid-80's I sped up the sorting of 6 million University entrance examination results on twinned PDP-11/24s from 10 days to just over a day (by reversing the order of the joins involved).

Today, I've sorted 60% more records, in perl, in 108 seconds.

I can't start work with a sorted file, because the input file needs to be pre-processed to extract the 1 billion words from the 100 million lines. Those 1 billion words need to be uniq'd to reduce their volume by roughly 3 orders of magnitude before sorting.

I could (externally) sort the 1 billion words; then uniq them, but that means O(N log N) where N is 1 billion, rather than O(N log N) where N = 1 million. Now assuming that each record takes 0.0000113 to sort--that's based upon my Perl sort test, no external sort is going to achieve that--the trade of sorting before uniq'ing versus after is:

  1. 1e9 * log 1e9 * 0.0000113 = 28.25 hours.
  2. 1e6 * log 1e6 * 0.0000113 = 1 minute 7.8 seconds.

Now, those per item timings are based upon an in-memory sort. To get a true picture we'd need to use figures from an external sort. So, I sorted 1 million items and the sort time was 0.00001535; then I sort 10 million items and the time was 0.00002198. That's a 43% increase in time per item, for ten times the data.

So for 100 million, say 0.00003143. And for 1 billion say: 0.00004494. Which means externally sorting 1 billion items is likely to be closer to 5 days than 4. I hope you now see that uniq'ing before the sort is imperative.

Just as not everything new is better than the old, vice versa is also true.

It's simple physics. If you squeeze a large volume of data (~10GB), through a small hole (~1MB ram as used by GNU sort), you have to shuffle it on and off disc many times. "memory may be a disc too", but it's bloody fast compared to real disc. Conversely, disc makes bloody lousy memory.

“Don’t diddle the code to make it faster: choose a better algorithm.”

Uniq'ing then sorting in memory: 108 seconds.

Pre-external sorting and then uniq'ing: 4.68125 days.

Which do you see as the better algorithm?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^2: In-place sort with order assignment

Replies are listed 'Best First'.
Re^3: In-place sort with order assignment
by Limbic~Region (Chancellor) on Sep 21, 2010 at 15:37 UTC
    BrowserUk,
    As I read through this post, a couple of my own posts popped into my head (What Is A Word? and How many words does it take?). When you say "word" and "line", do you mean englishish text? I would be pretty suprised to see a dictionary of 1+ million real words but I guess it is possible. If this process has to be repeated multiple times and if most, if not all, words from one run to the next will be seen again, then perhaps there is an even faster way of doing this (using a pre-built dictionary).

    Cheers - L~R

      For "words" read 'searchable terms'.

      Which would include strings of digits; proper nouns; and in a generic solution, any language.

      For testing, I'm currently using split /[^A-Za-z0-9.'+-]+/, $line; but ultimately that should be configurable.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.