in reply to Re^2: mathematical proof in thread mathematical proof
The result is that building either the hash or the array is O(n).
Ah yes, I see.
If the data structures are big enough that they live on disk
While there are other data structures available to him (such as diskbased structures and tries), I chose to only speak about the ones he mentioned (hashes and arrays) due to time constraints.
Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort.
You'll reduce the size of large passes you never have to do with a hash.
You'll speed up the sort you don't have to do with a hash.
With a hash, you never have to deal with more than $num_duplicates items. With an array, you'll deal with at least $num_duplicates items. I don't understand how come you say the array benefits more.
Re^4: mathematical proof by tilly (Archbishop) on Feb 03, 2009 at 17:27 UTC 
Hashes and arrays can both be implemented as in memory data structures, or on disk data structures. Therefore it is perfectly reasonable to talk about what each looks like in memory and on disk.
In Perl the two options don't even look different for hashes, you just add an appropriate tie. The difference is slightly larger for arrays because you don't want to use the builtin sort on a large array that lives on disk.
As for my duplicates comment, I am not denying that a hash is better than an array whether or not there are duplicates. However the speed of accessing a hash is basically independent of how many duplicates there are in your incoming data structure. (There are subtle variations depending on, for instance, whether you are just before or after a hash split, but let's ignore that.) The speed of doing a merge sort that eliminates duplicates ASAP varies greatly depending on the mix of duplicates in your incoming data structure. Therefore the array solution improves relative to the hash as you increase the number of duplicates. This doesn't make the array solution
In fact in the extreme case where you have a fixed number of distinct lines in your data structure, the array solution improves from O(n log(n)) to O(n). The hash solution does not improve, it is O(n) regardless.  [reply] 
