Re: Heap sorting in perl
by Ovid (Cardinal) on Apr 05, 2003 at 15:08 UTC

Can you give some examples where it's having problems? Perhaps we can figure out what's going on. To be fair, I haven't used Heap.pm, but I did notice that it's tests are not terribly comprehensive, so I'm not surprised. I confess a desire to see what's not working and start writing tests for it :)
If you happen to have a copy of "Mastering Algorithms with Perl", by O'Reilly, they have a heap implementation in there along with an excellent discussion. Perhaps rolling your own my help?
Cheers,
Ovid
New address of my CGI Course.
Silence is Evil (feel free to copy and distribute widely  note copyright text)
 [reply] 

Thanks for the book tip... After skimming through the "Mastering Algorithms" section on heaps and reading through the Heap.pm source, I was able to build a small script that seems to work:
#!/usr/bin/perl wT
use strict;
use Heap::Binary;
use Heap::Elem::Num;
my $heap = Heap::Binary>new;
for (25,33,107,5,42,29) {
my $elem = Heap::Elem::Num>new($_);
$heap>add( $elem );
}
my $min = $heap>extract_minimum();
print "Min: " . $min>val . "\n"; # prints 5
Unfortunately this doesn't exactly match the usage explained in the POD. It could be just a documentation bug, but it worries me a bit.
Blake
 [reply] [d/l] 

 [reply] 

I got the same error with Perl 5.8 on Linux. After a quick search turned up mixed reviews for the module, I decided to post here before making my next move. At the very least, I'll email the author of Heap with a bug report.
Blake
 [reply] 
Re: Heap sorting in perl
by blakem (Monsignor) on Apr 05, 2003 at 19:16 UTC

The responses so far are very helpful, but I probably should have given a bit more context.....
The resource I am optimizing for here is memory. I have a mod_perl process that slurps up a large dataset from a database, sorts it, then displays the first N items. This happens to balloon the memory consumption of the apache process to an unacceptable level.
One approach to this, of course, is to sort the data in the database, but in this case I need the ability to define the sort function as arbritrary perl code.
Heap sorting is attractive because selecting the smallest M elements from a dataset of size N (N being much larger than M) requires storing only M elements in memory at any one time. This would be a huge win for me.
A heap is a datastructure that makes two operations very efficient: inserting a new element, and selecting (or removing) the largest element. The pseudocode for my algorithm would look something like this:
# select the smallest $M elements;
while ($new_element = $dbh>fetchrow) {
if ($heap>size < $M) { # build a heap of size M
$heap>insert($new_element);
} elsif ($element < $heap>largest) { # maintain a heap containing
# the smallest M elements
# seen so far
$heap>remove_largest;
$heap>insert($element);
}
}
At the end of this loop we will have a $heap containing the M smallest elements, and never have to store more than M elements in memory at once.
Therefore, I'm looking for a decent heap implementation and my first impression of Heap.pm was somewhat discouraging. I have several paths I can take from here, but before I go hacking up Heap.pm I was wondering if there might be a better place to start.
Blake
 [reply] [d/l] 

I have a mod_perl process that slurps up a large dataset from a database, sorts it, then displays the first N items. This happens to balloon the memory consumption of the apache process to an unacceptable level.
An obvious question, perhaps, but have you ruled out getting the database to do the work for you? That'd be a nobrainer if the column you're interested in is indexed. I'm guessing it isn't. Still, having the database do the sort might still be faster than scanning the entire table, pulling the data across a process boundary a row at a time, and then filtering it in the Apache process.
Assuming you have a fairly large table, here's an ugly heuristic you might try: Probe the first 2*M rows of the table, sort, and find the M'th largest value. Then run a query that does a sort on the database side, subsetting the data before sorting by adding a WHERE clause to reject any row whose target column has a value greater than the M'th value you found the first time through. You can tune this by trying 3*M rows, 4*M rows, etc. The idea is to reduce the set that the database needs to sort.
 [reply] 

Letting the datbase do the sort is a good approach, but I was hoping for a more general, memory efficient solution that allows for arbitrary perl comparison routines.
Its a common enough situation that I would like to create a subroutine that returns the rows I want and takes a database handle, a sorting coderef, M, and an optional offset. I'm envisioning something like:
my $sortedrows = db_heap_sort($dbh, $coderef, $M, $offset);
The question is, is there a perl heap implementation that would be suitable for implementing the above sub. Heap.pm seems like the obvious choice, but I have doubts about its current status.
I have emailed the Heap.pm author, and will follow up with any new information I find out.
Blake
 [reply] [d/l] 

If those are rows in a database, I agree with dws, just let database do it.
If that particular column is not indexed, I suggest index it, as you need it.
"order by" is usually not too expansive, especially when it only involves indexed range scan.
Depends on your database, some database directly support syntax that allows to retrieve first N rows.
 [reply] 



Isn't there a solution posted already with working code
which solves exactly this memory problem? It could readily be sped up substantially, and needs more testing though. But consider it proof of concept...
 [reply] 
Re: Heap sorting in perl
by adrianh (Chancellor) on Apr 05, 2003 at 14:49 UTC

No comments on Heap not having ever used it.
However, unless you are dynamically adding and removing items to your dataset wouldn't something like:
sub smallest_n (&\@$) {
my ($cmp, $arrayref, $n) = @_;
return unless $n && @$arrayref;
$n = @$arrayref if @$arrayref < $n;
my @results = sort $cmp @$arrayref[0..$n1];
local ($a, $b);
$a = pop @results;
for (my $i = $n; $i < @$arrayref; $i++) {
$b = $arrayref>[$i];
if ($cmp>() == 1) {
@results = sort $cmp (@results, $b);
$a = pop @results;
};
};
return (@results, $a);
};
use Test::More tests => 1;
use List::Util qw(shuffle);
my @a = shuffle (1..100000);
my @b = smallest_n {$a <=> $b} @a, 5;
is_deeply [@b], [1,2,3,4,5];
be easier? The support for the heap datastructure will add a fair bit of overhead with your large dataset, so I wouldn't use one unless you need it.
Update: totally misready what blakem was proposing. D'oh.  [reply] [d/l] 

 [reply] 

Erm. No :)
You have to add all the items from the dataset to the heap before you can remove the N lowest (or highest, depending on the direction your grow your tree).
Yes, you could have the heap as an outofmemory structure. However, if the dataset is on disk you can just read it in element by element and use the algorithm I proposed. It's still going to be less expensive in time and space than creating a heap.
Unless you are goin to be adding and removing entries from the data set and need to keep it ordered a heap is overkill for the problem as stated.
 [reply] 




Your algorithm runs in Omega (N k log k), where N is the
size of the set, and you are interested in the k smallest
elements. Which is pretty lousy. With k for instance N / 100,
your algorithm is worse than doing a bubble sort of the entire
set, and getting the k smallest from the sorted set.
Abigail
 [reply] 
Re: Heap sorting in perl
by Anonymous Monk on Apr 05, 2003 at 15:29 UTC

Rolling your own can be educational. I did it because I thought that it would be trivial to do. By the time I finished..well I was educated. ;)
Note that I used the quicksort rather than heapsort because I assumed that the issue is memory usage, and quicksort is easier to do.
 [reply] [d/l] 
Re: Heap sorting in perl
by pg (Canon) on Apr 05, 2003 at 15:16 UTC

It seems not neccessary to have a separate sort step.
You can do this:
 First take N elements from the begining of the unsorted array, and form a new array.
 Sort the new array, which would be much fast, assuming N is much smaller than the original array size.
 Go thru the rest array element in the original array, insert a element into the sorted array, if it is smaller than the largest element of the new array, and does not exist in the new array, and also get rid of the largest element in the new array.
 This algorithm goes thru the entire original array, but only for comparing, and most of the time, no extra effort (inserting) other than comparing is needed.
(strike out per adrianh's comment)If you are willing to spend a little bit more effort, then I sugegst this improvement:
In step 3, instead of going thru the entire original array, you can chop the original array into pieces, say each piece contains 10 * N element (tweak with this 10, it could be 5, could be 50...). Sort each piece (we are sorting some smaller arrays), then only take the first N elements of the sorted piece, and go thru them.
 [reply] 

A heap is a specialized datastructure that can be thought of as an "automatically sorted array" given a somewhat scaled down definition of "sorted". Sorted in this case means that the largest element is always easy to find and remove, and inserting new elements is also easy.
Given that, our steps to find the smallest M elements would be:
 Build a heap from our first M items
 Compare next item to largest element in heap
 Replace largest with new if new < largest
 Repeat steps 2 and 3 for all remaining items
As you can see, this is very similar to the strategy you proposed. In fact, you could view heaps as a datastructure designed specifically to implement this algorithm efficiently.
Blake
 [reply] 

 [reply] 

adrianh is obviously right about this.
 [reply] 
Re: Heap sorting in perl
by AbigailII (Bishop) on Apr 06, 2003 at 21:51 UTC

#!/usr/bin/perl
use strict;
use warnings;
my @heap;
sub heapify;
sub heapify {
my $idx = shift;
my $min = $idx;
for my $try (2 * $idx + 1, 2 * $idx + 2) {
$min = $try if $try < @heap && $heap [$try] < $heap [$min]
}
return if $min == $idx;
@heap [$idx, $min] = @heap [$min, $idx];
heapify $min;
}
sub extract_min {
return unless @heap;
my $min = $heap [0];
my $tmp = pop @heap;
if (@heap) {
$heap [0] = $tmp;
heapify 0;
}
return $min;
}
#
# First argument the number of smallest elements to be returned.
# Then the list to find the elements in.
#
sub smallest {
my $k = shift;
@heap = @_;
for (my $i = int (@heap / 2); $i ;) {
heapify $i;
}
my @result;
push @result => extract_min while @heap && $k ;
@result;
}
__END__
The running time of this is O (N + k log N), where N is the
size of the set, and k the number of elements to be returned.
Abigail  [reply] [d/l] 

 [reply] 

 [reply] 





Re: Heap sorting in perl
by pg (Canon) on Apr 06, 2003 at 03:43 UTC

Read that Heap.pm, and its default implementation Heap::Fibonacci. To be frank, don't have a good impression.
The biggest problem I have with it is the storage it used. It used a tree based on hash, instead of a tree based on Nsequential array. A hash uses more memory than nsequential array, and slower in indexing as "Nsequential" is not utilized.
A Nsequential array is physically just a conventional array, but its element are specially positioned within the array, and can be easily viewed/used as a tree (in this case a heap).  [reply] 
Re: Heap sorting in perl
by aquarium (Curate) on Apr 07, 2003 at 10:24 UTC

No magic from me..sorry :)
Typical space vs speed problem. The faster (all in memory) based algorithms will eat up RAM; whilst slower (bubblesort, quicksort: even via temporary files if you need to) consume less memory but are slow.
It's more of a question of striking the right balance of space use and speed attained. If we had any idea as to the sorting required, then perhaps someone could pull a rabbit out of a hat for this onealas without knowing what SORT you need, no efficient sorting algorithm can be devised.
Chris  [reply] 