in reply to Re^2: mathematical proof
in thread mathematical proof

The result is that building either the hash or the array is O(n).

Ah yes, I see.

If the data structures are big enough that they live on disk

While there are other data structures available to him (such as disk-based structures and tries), I chose to only speak about the ones he mentioned (hashes and arrays) due to time constraints.

Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort.

You'll reduce the size of large passes you never have to do with a hash.

You'll speed up the sort you don't have to do with a hash.

With a hash, you never have to deal with more than $num_duplicates items. With an array, you'll deal with at least $num_duplicates items. I don't understand how come you say the array benefits more.

Replies are listed 'Best First'.
Re^4: mathematical proof
by tilly (Archbishop) on Feb 03, 2009 at 17:27 UTC
    Hashes and arrays can both be implemented as in memory data structures, or on disk data structures. Therefore it is perfectly reasonable to talk about what each looks like in memory and on disk.

    In Perl the two options don't even look different for hashes, you just add an appropriate tie. The difference is slightly larger for arrays because you don't want to use the built-in sort on a large array that lives on disk.

    As for my duplicates comment, I am not denying that a hash is better than an array whether or not there are duplicates. However the speed of accessing a hash is basically independent of how many duplicates there are in your incoming data structure. (There are subtle variations depending on, for instance, whether you are just before or after a hash split, but let's ignore that.) The speed of doing a merge sort that eliminates duplicates ASAP varies greatly depending on the mix of duplicates in your incoming data structure. Therefore the array solution improves relative to the hash as you increase the number of duplicates. This doesn't make the array solution

    In fact in the extreme case where you have a fixed number of distinct lines in your data structure, the array solution improves from O(n log(n)) to O(n). The hash solution does not improve, it is O(n) regardless.