comment on

Could I perchance repeat my suggestion of, “do it the COBOL way?”

No. I read you the first time.

If you read the rest of the thread, you'd see that option has already been tested and comprehensively dismissed as a possibility.

As for all the "old-timer" logic. Been there, done that. (Don't ever wanna go back:).

In the mid-80's I sped up the sorting of 6 million University entrance examination results on twinned PDP-11/24s from 10 days to just over a day (by reversing the order of the joins involved).

Today, I've sorted 60% more records, in perl, in 108 seconds.

I can't start work with a sorted file, because the input file needs to be pre-processed to extract the 1 billion words from the 100 million lines. Those 1 billion words need to be uniq'd to reduce their volume by roughly 3 orders of magnitude before sorting.

I could (externally) sort the 1 billion words; then uniq them, but that means O(N log N) where N is 1 billion, rather than O(N log N) where N = 1 million. Now assuming that each record takes 0.0000113 to sort--that's based upon my Perl sort test, no external sort is going to achieve that--the trade of sorting before uniq'ing versus after is:

1e9 * log 1e9 * 0.0000113 = 28.25 hours.
1e6 * log 1e6 * 0.0000113 = 1 minute 7.8 seconds.

Now, those per item timings are based upon an in-memory sort. To get a true picture we'd need to use figures from an external sort. So, I sorted 1 million items and the sort time was 0.00001535; then I sort 10 million items and the time was 0.00002198. That's a 43% increase in time per item, for ten times the data.

So for 100 million, say 0.00003143. And for 1 billion say: 0.00004494. Which means externally sorting 1 billion items is likely to be closer to 5 days than 4. I hope you now see that uniq'ing before the sort is imperative.

Just as not everything new is better than the old, vice versa is also true.

It's simple physics. If you squeeze a large volume of data (~10GB), through a small hole (~1MB ram as used by GNU sort), you have to shuffle it on and off disc many times. "memory may be a disc too", but it's bloody fast compared to real disc. Conversely, disc makes bloody lousy memory.

“Don’t diddle the code to make it faster: choose a better algorithm.”

Uniq'ing then sorting in memory: 108 seconds.

Pre-external sorting and then uniq'ing: 4.68125 days.

Which do you see as the better algorithm?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

In reply to Re^2: In-place sort with order assignment by BrowserUk
in thread In-place sort with order assignment by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks