Re^3: Heap sorting in perl

Erm. No :-)

You have to add all the items from the dataset to the heap before you can remove the N lowest (or highest, depending on the direction your grow your tree).

Yes, you could have the heap as an out-of-memory structure. However, if the dataset is on disk you can just read it in element by element and use the algorithm I proposed. It's still going to be less expensive in time and space than creating a heap.

Unless you are goin to be adding and removing entries from the data set and need to keep it ordered a heap is overkill for the problem as stated.

Comment on Re^3: Heap sorting in perl

Replies are listed 'Best First'.
Re: Re^3: Heap sorting in perl by Anonymous Monk on Apr 05, 2003 at 15:51 UTC
I don't follow your reasoning. If you know when you start how many you will want in the end, what would possibly prevent you from adding and removing from your heap before you had all of the data in? Sure, the first "maximum" removed and thrown away might not be the biggest element in the whole set. But it wasn't in the smallest N, and that is all we care about. Who cares what order we throw away the rest in?? (As long as we throw it away before running out of memory!) As for overkill, using Perl to solve a problem that can be solved with the correct option to the Unix sort utility is overkill. But if you need a general purpose solution in Perl and you have a heap implementation there already, why not use it?	[reply]
Re^5: Heap sorting in perl by adrianh (Chancellor) on Apr 05, 2003 at 16:29 UTC
I don't follow your reasoning. We were talking at cross purposes. I thought that you were proposing to heap sort the entire dataset, not the subset of smallest elements. I admit that my using `sort` to order the smallest subset is a hack, but unless N is large or the comparison function is expensive, it will be faster than a pure-perl heap. As for overkill, using Perl to solve a problem that can be solved with the correct option to the Unix sort utility is overkill. But if you need a general purpose solution in Perl and you have a heap implementation there already, why not use it? I don't see how the Unix sort applies. We don't want to sort the whole dataset - we want the smallest N elements (and we also don't know if the dataset consists of line oriented data.)	[reply] [d/l]
Re: Re^5: Heap sorting in perl by Anonymous Monk on Apr 05, 2003 at 17:02 UTC
I was proposing to solve the problem originally stated. I thought that my description of how to solve it using a heap was fairly clear on that point by telling you when to start removing elements while still adding them. I agree that until N is large (causing, eg, memory problems) or the comparison function is expensive, using sort will be faster. The heap solution outlined is helpful only for large N. For an expensive comparison function, it is helpful once you start pulling off elements to keep track of the largest that you have in your set. Anything which is larger than or equal to that one should not be added. This reduces the number of comparisons significantly. As for the sort utility, I had thought that it had an option to only return the largest N elements. But looking at it more carefully I don't see it. It must have been somewhere else that I saw that. My bad. :-(	[reply]
Re^7: Heap sorting in perl by adrianh (Chancellor) on Apr 05, 2003 at 20:12 UTC


"be consistent"
	PerlMonks