http://www.perlmonks.org?node_id=248325


in reply to Heap sorting in perl

The responses so far are very helpful, but I probably should have given a bit more context.....

The resource I am optimizing for here is memory. I have a mod_perl process that slurps up a large dataset from a database, sorts it, then displays the first N items. This happens to balloon the memory consumption of the apache process to an unacceptable level.

One approach to this, of course, is to sort the data in the database, but in this case I need the ability to define the sort function as arbritrary perl code.

Heap sorting is attractive because selecting the smallest M elements from a dataset of size N (N being much larger than M) requires storing only M elements in memory at any one time. This would be a huge win for me.

A heap is a datastructure that makes two operations very efficient: inserting a new element, and selecting (or removing) the largest element. The pseudo-code for my algorithm would look something like this:

# select the smallest $M elements; while ($new_element = $dbh->fetchrow) { if ($heap->size < $M) { # build a heap of size M $heap->insert($new_element); } elsif ($element < $heap->largest) { # maintain a heap containing # the smallest M elements # seen so far $heap->remove_largest; $heap->insert($element); } }
At the end of this loop we will have a $heap containing the M smallest elements, and never have to store more than M elements in memory at once.

Therefore, I'm looking for a decent heap implementation and my first impression of Heap.pm was somewhat discouraging. I have several paths I can take from here, but before I go hacking up Heap.pm I was wondering if there might be a better place to start.

-Blake

Replies are listed 'Best First'.
Re: Re: Heap sorting in perl
by dws (Chancellor) on Apr 05, 2003 at 20:58 UTC
    I have a mod_perl process that slurps up a large dataset from a database, sorts it, then displays the first N items. This happens to balloon the memory consumption of the apache process to an unacceptable level.

    An obvious question, perhaps, but have you ruled out getting the database to do the work for you? That'd be a no-brainer if the column you're interested in is indexed. I'm guessing it isn't. Still, having the database do the sort might still be faster than scanning the entire table, pulling the data across a process boundary a row at a time, and then filtering it in the Apache process.

    Assuming you have a fairly large table, here's an ugly heuristic you might try: Probe the first 2*M rows of the table, sort, and find the M'th largest value. Then run a query that does a sort on the database side, subsetting the data before sorting by adding a WHERE clause to reject any row whose target column has a value greater than the M'th value you found the first time through. You can tune this by trying 3*M rows, 4*M rows, etc. The idea is to reduce the set that the database needs to sort.

      Letting the datbase do the sort is a good approach, but I was hoping for a more general, memory efficient solution that allows for arbitrary perl comparison routines.

      Its a common enough situation that I would like to create a subroutine that returns the rows I want and takes a database handle, a sorting coderef, M, and an optional offset. I'm envisioning something like:

      my $sortedrows = db_heap_sort($dbh, $coderef, $M, $offset);
      The question is, is there a perl heap implementation that would be suitable for implementing the above sub. Heap.pm seems like the obvious choice, but I have doubts about its current status.

      I have emailed the Heap.pm author, and will follow up with any new information I find out.

      -Blake

      If those are rows in a database, I agree with dws, just let database do it.

      If that particular column is not indexed, I suggest index it, as you need it.

      "order by" is usually not too expansive, especially when it only involves indexed range scan.

      Depends on your database, some database directly support syntax that allows to retrieve first N rows.
        If that particular column is not indexed, I suggest index it, as you need it.

        If this is a write-mostly, query-infrequently table, indexing could significantly degrade performance. For log/audit files that record large numbers of miscellaneous value, indexing the values is rarely an effective strategy.

Re: Re: Heap sorting in perl
by Anonymous Monk on Apr 06, 2003 at 00:18 UTC
    Isn't there a solution posted already with working code which solves exactly this memory problem? It could readily be sped up substantially, and needs more testing though. But consider it proof of concept...