Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Better mousetrap (getting top N values from list X)

by Limbic~Region (Chancellor)
on Feb 01, 2005 at 19:52 UTC ( #427029=perlquestion: print w/ replies, xml ) Need Help??
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
In this node, I pointed out that it is often a waste to sort a list if you were after the top value and that a watermark algorithm should be used instead. I am not sure this rule of thumb holds true when N != 1. The size of list X and N seem to be contributing factors.

Here is the code I used in that node:

sub top_x { my ($list, $x) = @_; $x--; my @top; $#top = $x; for my $item ( @$list ) { next if defined $top[ -1 ] && $item <= $top[ -1 ]; for my $id ( 0 .. $#top ) { $top[ $id ] = $item and last if ! defined $top[ $id ]; if ( $item > $top[ $id ] ) { @top[ $id .. $#top ] = ($item, @top[ $id .. $#top - 1] +); #splice(@top, $id, 0, $item), pop @top; last; } } } return @top; }
After writing it, I wondered:
  • At what value of N does one of the 2 methods (splice vs array slice) beat the other?
  • Does the length of the list X and the value of N have to be considered together when deciding?
  • Under what circumstances is sorting better than watermarks?
  • Is there a better way to keep track of the watermarks other than the one I have shown?
  • Is there a better method, other than sorting and watermarks, for determining the top N of a list?

Cheers - L~R

Comment on Better mousetrap (getting top N values from list X)
Download Code
Re: Better mousetrap (getting top N values from list X)
by Limbic~Region (Chancellor) on Feb 01, 2005 at 20:13 UTC

    tye, a very busy man, responded in the CB that Heap::Simple, thanks to a feature patch added by him, was a better alternative. He went on to indicate an unfinished module of his, Data::Heap, would make it even simpler. Since he didn't believe he would get to reply in a timely manner, I am posting here on his behalf since I believe it has value to the discussion. Depending on how you look at Heaps, this is an answer to one of the last two bullets.

    Cheers - L~R

Re: Better mousetrap (getting top N values from list X)
by Aristotle (Chancellor) on Feb 01, 2005 at 20:33 UTC

    In Perl, the fastest watermark algorithm is probably

    sub top_x { my $n = shift; my @top = splice @_, 0, $n; @top = ( sort { $a <=> $b } $_, @top )[ 1 .. $n ] for @_; return @top; }

    With the mergesort used in newer versions of Perl and @top being nearly sorted in all iterations but the first, sort will do its work rapidly. That will almost certainly beat any explicitly spelled out algorithm except for truly large values of $n and even longer lists (like maybe selecting the top 10,000 out of 1,000,000 elements; maybe not even that). Though I'm not sure it even beats a straight sort+slice… achieving that probably requires a list of a few thousand elements.

    I had a similar wakeup call when I tried to use a heap to compete against a splice algorithm a while ago. (I can't be bothered to Super Search it right now.)

    If someone cares to benchmark this, I'd very interested to see how the numbers look in practice.

    It is sometimes frustrating, but clever Perl algorithms can very rarely beat builtins. If you want competitive algorithmic elegance, you'll have to drop back to XS. An XS call has a certain fixed overhead cost though so for small lists you might still lose.

    Makeshifts last the longest.

      See my pad for a benchmark.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.

        Why not post it in the thread? Here's an adapted bench with a version of my code modified slightly to be callable like the others:

        It is interesting to see that my code wins hands down when N/MAX is close to 1. Even when the ratio of N/MAX shrinks, my code loses a lot of ground but keeps beating Limbic's proposition. None of this matters much though since for large MAX, all of the solutions perform very similarly, even if the trends remain clear.

        So out of curiosity I added the following bit to the code:

        baseline => sub { my ( $n, $list ) = @_; return ( sort { $a <=> $b } @$list )[ @$list - $n .. $#$list ] +; },

        Well, I'll just say let's pack the bags and go home folks. Nothing to see here, move along. As I said: clever Perl code vs builtin: clever Perl code loses. Grossly disproportionately, in fact.

        (Spoiler for anyone who doesn't care to run the benchmarks: in all cases the baseline sort version runs hundreds to thousands of times faster than the other solutions.)

        Makeshifts last the longest.

Re: Better mousetrap (getting top N values from list X)
by BrowserUk (Pope) on Feb 01, 2005 at 20:47 UTC

    It will be interesting to see how this fares in a benchmark.

    #! perl -slw use strict; use List::Util qw[ reduce ]; sub topN{ my( $n, $aref ) = @_; my @topN; push @topN, reduce{ $a > $b && ( !@topN || $a < $topN[ -1 ] ) ? $a : ( !@topN || $b < $topN[ -1 ] ) ? $b : $a; } @$aref for 1 .. $n; return @topN; } my @test = 1 .. 100; print join ', ', topN 5, \@test;

    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
Re: Better mousetrap (getting top N values from list X)
by BrowserUk (Pope) on Feb 02, 2005 at 05:27 UTC

    As the benchmarks show, your algorithm, is a good one. With a few tweaks to the implementation it runs faster still:

    sub top_x { my( $n, $aref ) = @_; my @topN = (0)x$n--; for my $val ( @$aref ) { next if $topN[ $n ] > $val; $topN[ $_ ] < $val and splice( @topN, $_, 0, $val ), last for 0 .. $n; } return @topN[ 0 .. $n ]; }

    If this is more than an intellectual exercise, and you can handle Inline::C, then the same algorithm C-ified really flies:

    Update: Correct the Inline::C implementation below to avoid calloc and allow me to free the temporary C array.

    void topN( int n, AV*data ) { int *topN; int len = av_len( data ); int i, j, k; Inline_Stack_Vars; Newz( 1, topN, n + 1, int ); for( i = 0; i <= len; i++ ) { int val = SvIV( *av_fetch( data, i, 0 ) ); for( j = 0; j < n; j++ ) { if( topN[ j ] > val ) continue; if( topN[ j ] < val ) { for( k = n; k > j; k-- ) topN[ k ] = topN[ k-1 ]; topN[ j ] = val; break; } } } Inline_Stack_Reset; for( i = 0; i < n; i++ ) Inline_Stack_Push( sv_2mortal( newSViv( topN[ i ] ) ) ); Safefree( topN ); Inline_Stack_Done; }

    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      BrowserUk,
      Thanks. I didn't spend any time trying to tweak it to squeeze every last bit of power out of it so I am pleased that it performed well. I do have a confession to make. The reason I asked the question in the first place is so that I could get a free education. Let me explain:

      While I have illusions of grandeur, I realize that many people that frequent this site are both far more knowledgeable and intelligent than I am. I haven't had any formal training on algorithms and I can't seem to force myself to do any reading on my own. I have assimilated lots of little pieces of information which allow me to write fairly efficient code naturally but I couldn't figure out the big O if you held a gun to my head. Additionally, I don't have the ability to see how changing factors, such as the length of each list, effects a given algorithm without just testing it.

      The problem with just using the TIAS approach is that you can get misleading results depending on your sample size. This isn't to say that I don't believe benchmarking isn't valuable. I do benchmark, but I seem to retain more from seeing how others solve the same problem. It is as though until they expand the boundaries of the box I am thinking in, there is only so far I can stretch my imagination.

      So again - thanks - to you, and everyone else here at the Monastery that has given me a high-priced education over the last 2 1/2 years for free.

      Cheers - L~R

        FWIW, I didn't really set out to tweak your algorithm as such. It just kind of appalled me that for many combinations of total set/desired subset sizes, the (emperically) quickest algorithm available to the Perl programmer proved to be sort'n'slice!

        Even with really quite large datasets (10k & 100k), as soon as the required subset moved above around 10%, the overhead of applying even quite a small number of perl opcodes to each value in the total set--unavoidably O(N)--outweighted the costs of performing an O(NlogN) (or whatever) sort algorithm in C.

        So I set out to see how close I could get to the C/sort performance in Perl. Starting with your algorithm was the obvious choice as it outformed everything else offered. It's canny use of short-circuiting puts it head and shoulders above the other algorithms. Then it became a case of seeing how few (and the least expensive) opcodes one could use to fulfill it.

        There may be a little more that could be trimmed from my reworking of your algorithm, but it rapidly became obvious that the only way to beat the sort'n'slice method would be to move to C--and implementing your algorithm was the obvious choice again.

        Once you start comparing like with like (ie. C with C), then the benefits of your, basically O(N), algorithm shine relative to the O(NlogN) of the sort and it wins in most (though not all) cases over the sort as you would expect.

        I'm not yet sure why the sort still wins occasionally with large subsets of large total sets--it probably comes down to the cost of extending the Perl stack to accomodate the return list, combined with the need to splice new high values into the return list as they are discovered. Perhaps someone with more XS experience than I could make it work better.

        The upshot as far as I am concerned, is a comfirmation of something I've voiced here on a few occasions. Big O notation, useful as it is, doesn't tell the whole story when (some parts of) one algorithm are performed at the C level and (some parts of) the other algorithm are performed at the Perl level. And even when both are done in C, it is very difficult to incorporate the costs of the housekeeping of an algorithm (memory allocation etc.) ) into the overall O-notation costs.

        In this case, whilst it ought to be easy to beat sort'n'slice--and is for smallish subsets of smallish total sets--it proved to be a lot harder to achieve that for the general case.

        So, whilst O-notation can give very good insights into the potential comparative costs of algorithms, in the end, a good oldfashioned benchmark of the actual implementations is always required to make the final determination.

        I've no doubt that were it possible to consider all the parts of the implementation of both algorithms in great detail, it would be possible to make the O-notation reflect the reality of them, but in my experience, that takes much longer and is much harder to do than simply code them and test it.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
Re: Better mousetrap (getting top N values from list X)
by tall_man (Parson) on Feb 02, 2005 at 22:16 UTC
    There is one more improvement for your algorithm, Limbic~Region. Instead of a simple insertion sort on the top list, you could do a binary insertion sort. This starts to pay off as the value N increases. I made a new benchmark with topN (by BrowserUk, with a small fix I added to put back the first short-circuit test), topNbs (based on the same code but with a binary search to find the insert point), aristotle's method, BrowserUk's method, and the original limbic method:

    Here are the results:

    As you can see, the topNbs starts to pay off when we need the top 500 or so. For the top 5, topN is better.

    Update: I have fixed the redundant lines in topNbs that BrowserUk pointed out, and re-run the benchmarks. Now topNbs does as well or better than topN for all cases. Now, the last thing that would be fun to try is a C-coded heap...

      Very nice++

      Ps. The top line, and the calloc() on the second line of topNbs are artifacts from my first attempt at topN(), and are redundant :)


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
Re: Better mousetrap (getting top N values from list X)
by demerphq (Chancellor) on Feb 03, 2005 at 07:32 UTC

    Just thought i should mention that I recall discussion on P5P about optimizing sort when only N values will be used of the return. Assuming that work actually has been done I should think that

    my ($x,$y,$z)=sort @foo;

    Would be the most efficient way to do this. Problem is I really dont remember what the result of that thread was. :-(

    ---
    demerphq

      I don't suppose you recall how they were going to determine how many values to produce?

      Is that information --ie. the number of values required on the right-hand side of a list assignment--generally available to XS code, or was the intention to make a special case for the construct?


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.

        IIRC the idea was to apply much the same type of logic as occurs with split. Ie the logic that implicitly sets the third argument of the split to be N+1 where N is the number of scalar slots on the LHS of the assignment. I hazzily recall discussion on whether using an explicit slice would also do the same. I think if you trawl the p5p archives for sort and optimize youll find it. I think the basic idea was that

        my ($x,$y,$z)=sort @foo; my @top=(sort @foo)[1..$n];

        would be special cased somehow.

        As for your question about XS, I really have no idea right now. Sorry.

        ---
        demerphq

      Yeah, but if you want to extract the top 26, do you really want to write:
      my @top26 = my($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, + $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z) = sort @foo
      or would you prefer:
      my @top26 = top (26, @foo);
      I'm willing to use up to three variables - but if I only want to extract the top3, I could easily write a single pass algorithm, keeping track of the top3.

      And you'd need to write tricky evals if you don't know which top N to take during compile time, but only at run time (for instance, because N is user input).

        Having tracked down (part of) the p5p thread demerphq mentioned, the suggestion was that you would also be able to employ the short-cicuited sort by using a slice:

        @most[ 0 .. 6 ] = sort @ary; # only sort 7 entries

        I'm not familiar with how the RHS hints are derived, but that would seem to address both your concerns.

        That said, I think topNbs() as posted by tall_man could be added to List::Util quite easily and would probably be easier to get accepted because of the lower risk.

        Then again, it doesn't really use a List, so maybe it is time for an Array::Util package.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
Re: Better mousetrap (getting top N values from list X)
by sleepingsquirrel (Hermit) on Feb 03, 2005 at 21:12 UTC
    Just as an aside, if you had lazy lists (coming in perl6) and the appropriate sort, the following should run in optimal O(X*log(N)) time, straight out of the box.
    @topN = (sort @X)[0..($N-1)];


    -- All code is 100% tested and functional unless otherwise noted.

      How would lazy lists allow that?

      You would still have to sort the whole array before you can perform the slice, because the last value in the array could be the highest.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        lazy sort


        -- All code is 100% tested and functional unless otherwise noted.
        I don't understand. If you were to extract the largest element of the list, you'd make one pass and find it, even if the largest element is at the end. No need to sort the entire array here.

        The same with finding the topN. Instead of keeping track of the largest element so far, you keep track of the largest N elements so far. If you do it in a heap, you can add an element, or remove the smallest element in O(log N) time. Still need one pass through the array. Don't have to sort the array. Note that if N equals the size of the list, you perform a sort in O(N log N) time. Which is optimal.

Re: Better mousetrap (getting top N values from list X)
by ambrus (Abbot) on Feb 05, 2005 at 22:33 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://427029]
Approved by Old_Gray_Bear
Front-paged by tall_man
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (10)
As of 2014-07-10 17:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (214 votes), past polls