in reply to
mathematical proof
Most hash insertions are O(1). When the hash has to be expanded, the insertion becomes O(N). Since Perl grows hashes exponentially, inserting N elements is O(N log N). ( See tilly's reply. Inserting N elements into a hash insertions is O(N). )
Most array appends are O(1). When the array has to be expanded, the append becomes O(N). Since Perl grows arrays exponentially, inserting N elements (one at a time) is O(N log N). ( See tilly's reply. Inserting N elements into a hash insertions is O(N). ) Inserting all N elements at once is just O(N), though. Once the array is loaded, you need to sort it, for a total of O(N log N).
So you end up with O(N log N) either way. The question that remains is what's factor, and how close to the bound (N log N) is the actual performance for your actual data.
The percentage of duplicates is going to matter. The more duplicates, the better the hash approach will be.
Re^2: mathematical proof by tilly (Archbishop) on Feb 03, 2009 at 05:02 UTC 
No, no, no. Perl's strategy for growing data structures makes the amortized average insert time O(1). So growing a data structure of size n is O(n), not O(n log(n)).
For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice. Half of them were around the previous time, and so have been around 3 times. And so on. So the total amount of moving is
n + n/2 + n/4 + ...
= (1 + 1/2 + 1/4 + ...) * n
= 2n
= O(n)
thanks to the wellknown geometric series.
With arrays the logic is similar except that each time we only grow an extra 1/5, and the overhead turns out to be 6n.
The result is that building either the hash or the array is O(n). But sorting the array is O(n log(n)). Therefore the hash scales better.
But (very big but) note that log(n) grows very slowly. If the data structures are big enough that they live on disk, then accessing your hash involves a disk seek, while merge sort involves accessing data on disk sequentially. Disk seeks are painfully slow, while accessing data on disk sequentially is orders of magnitude more efficient. That means that for data that lives on disk, the array approach is faster. Given the necessary numbers, generally by a couple of orders of magnitude.
But (very small but) hashes still scale better than arrays. So for very big data structures, hashing will again be faster than the array approach. However you are extremely unlikely to ever encounter a data set that large, if you did it probably wouldn't fit on your hard drive, and processing it sequentially by either approach would be too slow anyways. So you will want to parallelize your data processing. And when you do that, you will find that merge sort parallelizes better than hash lookups so arrays still win.
Thereby demonstrating that the theoretical bigO is not always the end of the story.
A final random note. Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort. Hashing, by contrast, goes at the same exact speed but will take up less memory. (Until you seek to disk...)
Update: Lots of minor edits to improve wording.  [reply] [d/l] 

For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice.
That assumes a hash split only after Ω(n) new inserts. But then in the worst case scenario where all elements hash to the same bucket, searching for a key has to walk a list, making the search linear instead of O(1).
I know hashes have changed in the past so that the linked list hanging of each bucket cannot grow unbounded. But that means that worst case, you either have hash splits in o(n) (lowercase o), or search times that aren't O(1).
Any "mathematical proof" will have to look at the C code and consider how often hash splits occur worst case, and how long the linked lists can be.
 [reply] 

As soon as someone talks about hash inserts being O(1) I assume that they are talking about the average case performance when your hash algorithm is working. I'm sorry I didn't make that explicit. If you wish to qualify everything I said about hashes with "in the average case", do so because that is correct. But if you wish to mix an analysis of the average case with comments about a worst case type of scenario, that's wrong.
Incidentally if you look at the worst case performance of a hash that uses linked lists for the buckets, then the hash access is O(n) which means that building a hash with n buckets cannot be better than O(n*n). Which clearly loses to the array solution. Conditional logic to try and reduce the likelyhood of worst case performance cannot improve this fact unless you replace the linked list in the bucket with something else (like a btree).
Changing the structure of the buckets will make the average case hash access O(1) and worst case O(log(n)) but complicates the code and makes the constant for a hash access worse. People tend to notice the average case performance, and so do not make that change. Real example: when abuse of the hashing function showed up as a security threat in Perl, multiple options were considered. In the end rather than making the worst case performance better, it was decided to randomize the hashing algorithm so that attackers cannot predict what data sets will cause performance problems.
 [reply] 




The result is that building either the hash or the array is O(n).
Ah yes, I see.
If the data structures are big enough that they live on disk
While there are other data structures available to him (such as diskbased structures and tries), I chose to only speak about the ones he mentioned (hashes and arrays) due to time constraints.
Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort.
You'll reduce the size of large passes you never have to do with a hash.
You'll speed up the sort you don't have to do with a hash.
With a hash, you never have to deal with more than $num_duplicates items. With an array, you'll deal with at least $num_duplicates items. I don't understand how come you say the array benefits more.
 [reply] 

Hashes and arrays can both be implemented as in memory data structures, or on disk data structures. Therefore it is perfectly reasonable to talk about what each looks like in memory and on disk.
In Perl the two options don't even look different for hashes, you just add an appropriate tie. The difference is slightly larger for arrays because you don't want to use the builtin sort on a large array that lives on disk.
As for my duplicates comment, I am not denying that a hash is better than an array whether or not there are duplicates. However the speed of accessing a hash is basically independent of how many duplicates there are in your incoming data structure. (There are subtle variations depending on, for instance, whether you are just before or after a hash split, but let's ignore that.) The speed of doing a merge sort that eliminates duplicates ASAP varies greatly depending on the mix of duplicates in your incoming data structure. Therefore the array solution improves relative to the hash as you increase the number of duplicates. This doesn't make the array solution
In fact in the extreme case where you have a fixed number of distinct lines in your data structure, the array solution improves from O(n log(n)) to O(n). The hash solution does not improve, it is O(n) regardless.
 [reply] 

