Better mousetrap (getting top N values from list X)

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Better mousetrap (getting top N values from list X) by Aristotle (Chancellor) on Feb 01, 2005 at 20:33 UTC
In Perl, the fastest watermark algorithm is probably `sub top_x { my $n = shift; my @top = splice @_, 0, $n; @top = ( sort { $a <=> $b } $_, @top )[ 1 .. $n ] for @_; return @top; }` [download] With the mergesort used in newer versions of Perl and `@top` being nearly sorted in all iterations but the first, sort will do its work rapidly. That will almost certainly beat any explicitly spelled out algorithm except for truly large values of `$n` and even longer lists (like maybe selecting the top 10,000 out of 1,000,000 elements; maybe not even that). Though I'm not sure it even beats a straight sort+slice… achieving that probably requires a list of a few thousand elements. I had a similar wakeup call when I tried to use a heap to compete against a splice algorithm a while ago. (I can't be bothered to Super Search it right now.) If someone cares to benchmark this, I'd very interested to see how the numbers look in practice. It is sometimes frustrating, but clever Perl algorithms can very rarely beat builtins. If you want competitive algorithmic elegance, you'll have to drop back to XS. An XS call has a certain fixed overhead cost though so for small lists you might still lose. Makeshifts last the longest.	[reply] [d/l]
Re^2: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 01, 2005 at 21:45 UTC
See my pad for a benchmark. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^3: Better mousetrap (getting top N values from list X) by Aristotle (Chancellor) on Feb 02, 2005 at 00:30 UTC
Why not post it in the thread? Here's an adapted bench with a version of my code modified slightly to be callable like the others: Read more... (3 kB) It is interesting to see that my code wins hands down when N/MAX is close to 1. Even when the ratio of N/MAX shrinks, my code loses a lot of ground but keeps beating Limbic's proposition. None of this matters much though since for large MAX, all of the solutions perform very similarly, even if the trends remain clear. So out of curiosity I added the following bit to the code: `baseline => sub { my ( $n, $list ) = @_; return ( sort { $a <=> $b } @$list )[ @$list - $n .. $#$list ] +; },` [download] Well, I'll just say let's pack the bags and go home folks. Nothing to see here, move along. As I said: clever Perl code vs builtin: clever Perl code loses. Grossly disproportionately, in fact. (Spoiler for anyone who doesn't care to run the benchmarks: in all cases the baseline sort version runs hundreds to thousands of times faster than the other solutions.) Makeshifts last the longest.	[reply] [d/l] [select]
Re^4: Better mousetrap (getting top N values from list X) by tall_man (Parson) on Feb 02, 2005 at 01:00 UTC
Re^5: Better mousetrap (getting top N values from list X) by Aristotle (Chancellor) on Feb 02, 2005 at 06:43 UTC
Re: Better mousetrap (getting top N values from list X) by Limbic~Region (Chancellor) on Feb 01, 2005 at 20:13 UTC
tye, a very busy man, responded in the CB that Heap::Simple, thanks to a feature patch added by him, was a better alternative. He went on to indicate an unfinished module of his, Data::Heap, would make it even simpler. Since he didn't believe he would get to reply in a timely manner, I am posting here on his behalf since I believe it has value to the discussion. Depending on how you look at Heaps, this is an answer to one of the last two bullets. Cheers - L~R	[reply]
Re: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 02, 2005 at 05:27 UTC
As the benchmarks show, your algorithm, is a good one. With a few tweaks to the implementation it runs faster still: `sub top_x { my( $n, $aref ) = @_; my @topN = (0)x$n--; for my $val ( @$aref ) { next if $topN[ $n ] > $val; $topN[ $_ ] < $val and splice( @topN, $_, 0, $val ), last for 0 .. $n; } return @topN[ 0 .. $n ]; }` [download] If this is more than an intellectual exercise, and you can handle Inline::C, then the same algorithm C-ified really flies: Update: Correct the Inline::C implementation below to avoid calloc and allow me to free the temporary C array. void topN( int n, AVdata ) { int topN; int len = av_len( data ); int i, j, k; Inline_Stack_Vars; Newz( 1, topN, n + 1, int ); for( i = 0; i <= len; i++ ) { int val = SvIV( *av_fetch( data, i, 0 ) ); for( j = 0; j < n; j++ ) { if( topN[ j ] > val ) continue; if( topN[ j ] < val ) { for( k = n; k > j; k-- ) topN[ k ] = topN[ k-1 ]; topN[ j ] = val; break; } } } Inline_Stack_Reset; for( i = 0; i < n; i++ ) Inline_Stack_Push( sv_2mortal( newSViv( topN[ i ] ) ) ); Safefree( topN ); Inline_Stack_Done; } [download] Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l] [select]
Re^2: Better mousetrap (getting top N values from list X) by Limbic~Region (Chancellor) on Feb 02, 2005 at 13:49 UTC
BrowserUk, Thanks. I didn't spend any time trying to tweak it to squeeze every last bit of power out of it so I am pleased that it performed well. I do have a confession to make. The reason I asked the question in the first place is so that I could get a free education. Let me explain: While I have illusions of grandeur, I realize that many people that frequent this site are both far more knowledgeable and intelligent than I am. I haven't had any formal training on algorithms and I can't seem to force myself to do any reading on my own. I have assimilated lots of little pieces of information which allow me to write fairly efficient code naturally but I couldn't figure out the big O if you held a gun to my head. Additionally, I don't have the ability to see how changing factors, such as the length of each list, effects a given algorithm without just testing it. The problem with just using the TIAS approach is that you can get misleading results depending on your sample size. This isn't to say that I don't believe benchmarking isn't valuable. I do benchmark, but I seem to retain more from seeing how others solve the same problem. It is as though until they expand the boundaries of the box I am thinking in, there is only so far I can stretch my imagination. So again - thanks - to you, and everyone else here at the Monastery that has given me a high-priced education over the last 2 1/2 years for free. Cheers - L~R	[reply]
Re^3: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 02, 2005 at 18:04 UTC
FWIW, I didn't really set out to tweak your algorithm as such. It just kind of appalled me that for many combinations of total set/desired subset sizes, the (emperically) quickest algorithm available to the Perl programmer proved to be sort'n'slice! Even with really quite large datasets (10k & 100k), as soon as the required subset moved above around 10%, the overhead of applying even quite a small number of perl opcodes to each value in the total set--unavoidably O(N)--outweighted the costs of performing an O(NlogN) (or whatever) sort algorithm in C. So I set out to see how close I could get to the C/sort performance in Perl. Starting with your algorithm was the obvious choice as it outformed everything else offered. It's canny use of short-circuiting puts it head and shoulders above the other algorithms. Then it became a case of seeing how few (and the least expensive) opcodes one could use to fulfill it. There may be a little more that could be trimmed from my reworking of your algorithm, but it rapidly became obvious that the only way to beat the sort'n'slice method would be to move to C--and implementing your algorithm was the obvious choice again. Once you start comparing like with like (ie. C with C), then the benefits of your, basically O(N), algorithm shine relative to the O(NlogN) of the sort and it wins in most (though not all) cases over the sort as you would expect. I'm not yet sure why the sort still wins occasionally with large subsets of large total sets--it probably comes down to the cost of extending the Perl stack to accomodate the return list, combined with the need to splice new high values into the return list as they are discovered. Perhaps someone with more XS experience than I could make it work better. The upshot as far as I am concerned, is a comfirmation of something I've voiced here on a few occasions. Big O notation, useful as it is, doesn't tell the whole story when (some parts of) one algorithm are performed at the C level and (some parts of) the other algorithm are performed at the Perl level. And even when both are done in C, it is very difficult to incorporate the costs of the housekeeping of an algorithm (memory allocation etc.) ) into the overall O-notation costs. In this case, whilst it ought to be easy to beat sort'n'slice--and is for smallish subsets of smallish total sets--it proved to be a lot harder to achieve that for the general case. So, whilst O-notation can give very good insights into the potential comparative costs of algorithms, in the end, a good oldfashioned benchmark of the actual implementations is always required to make the final determination. I've no doubt that were it possible to consider all the parts of the implementation of both algorithms in great detail, it would be possible to make the O-notation reflect the reality of them, but in my experience, that takes much longer and is much harder to do than simply code them and test it. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 01, 2005 at 20:47 UTC
It will be interesting to see how this fares in a benchmark. `#! perl -slw use strict; use List::Util qw[ reduce ]; sub topN{ my( $n, $aref ) = @_; my @topN; push @topN, reduce{ $a > $b && ( !@topN \|\| $a < $topN[ -1 ] ) ? $a : ( !@topN \|\| $b < $topN[ -1 ] ) ? $b : $a; } @$aref for 1 .. $n; return @topN; } my @test = 1 .. 100; print join ', ', topN 5, \@test;` [download] Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re: Better mousetrap (getting top N values from list X) by tall_man (Parson) on Feb 02, 2005 at 22:16 UTC
There is one more improvement for your algorithm, Limbic~Region. Instead of a simple insertion sort on the top list, you could do a binary insertion sort. This starts to pay off as the value N increases. I made a new benchmark with topN (by BrowserUk, with a small fix I added to put back the first short-circuit test), topNbs (based on the same code but with a binary search to find the insert point), aristotle's method, BrowserUk's method, and the original limbic method: Read more... (5 kB) Here are the results: Read more... (12 kB) As you can see, the topNbs starts to pay off when we need the top 500 or so. For the top 5, topN is better. Update: I have fixed the redundant lines in topNbs that BrowserUk pointed out, and re-run the benchmarks. Now topNbs does as well or better than topN for all cases. Now, the last thing that would be fun to try is a C-coded heap...	[reply] [d/l] [select]
Re^2: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 02, 2005 at 23:28 UTC
Very nice++ Ps. The top line, and the `calloc()` on the second line of topNbs are artifacts from my first attempt at topN(), and are redundant :) Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re: Better mousetrap (getting top N values from list X) by demerphq (Chancellor) on Feb 03, 2005 at 07:32 UTC
Just thought i should mention that I recall discussion on P5P about optimizing sort when only N values will be used of the return. Assuming that work actually has been done I should think that `my ($x,$y,$z)=sort @foo;` [download] Would be the most efficient way to do this. Problem is I really dont remember what the result of that thread was. :-( --- demerphq	[reply] [d/l]
Re^2: Better mousetrap (getting top N values from list X) by Anonymous Monk on Feb 03, 2005 at 10:12 UTC
Yeah, but if you want to extract the top 26, do you really want to write: `my @top26 = my($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, + $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z) = sort @foo` [download] or would you prefer: `my @top26 = top (26, @foo);` [download] I'm willing to use up to three variables - but if I only want to extract the top3, I could easily write a single pass algorithm, keeping track of the top3. And you'd need to write tricky evals if you don't know which top N to take during compile time, but only at run time (for instance, because N is user input).	[reply] [d/l] [select]
Re^3: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 03, 2005 at 10:33 UTC
Having tracked down (part of) the p5p thread demerphq mentioned, the suggestion was that you would also be able to employ the short-cicuited sort by using a slice: `@most[ 0 .. 6 ] = sort @ary; # only sort 7 entries` [download] I'm not familiar with how the RHS hints are derived, but that would seem to address both your concerns. That said, I think topNbs() as posted by tall_man could be added to List::Util quite easily and would probably be easier to get accepted because of the lower risk. Then again, it doesn't really use a List, so maybe it is time for an Array::Util package. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re^4: Better mousetrap (getting top N values from list X) by tall_man (Parson) on Feb 03, 2005 at 18:36 UTC
Re^5: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 03, 2005 at 22:17 UTC
Some notes below your chosen depth have not been shown here
Re^2: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 03, 2005 at 09:18 UTC
I don't suppose you recall how they were going to determine how many values to produce? Is that information --ie. the number of values required on the right-hand side of a list assignment--generally available to XS code, or was the intention to make a special case for the construct? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re^3: Better mousetrap (getting top N values from list X) by demerphq (Chancellor) on Feb 03, 2005 at 09:25 UTC
IIRC the idea was to apply much the same type of logic as occurs with split. Ie the logic that implicitly sets the third argument of the split to be N+1 where N is the number of scalar slots on the LHS of the assignment. I hazzily recall discussion on whether using an explicit slice would also do the same. I think if you trawl the p5p archives for sort and optimize youll find it. I think the basic idea was that `my ($x,$y,$z)=sort @foo; my @top=(sort @foo)[1..$n];` [download] would be special cased somehow. As for your question about XS, I really have no idea right now. Sorry. --- demerphq	[reply] [d/l]
Re^4: Better mousetrap (getting top N values from list X) (want N) by tye (Sage) on Feb 03, 2005 at 20:03 UTC
Re^5: Better mousetrap (getting top N values from list X) (want N) by BrowserUk (Patriarch) on Feb 03, 2005 at 22:48 UTC
Re: Better mousetrap (getting top N values from list X) by sleepingsquirrel (Chaplain) on Feb 03, 2005 at 21:12 UTC
Just as an aside, if you had lazy lists (coming in perl6) and the appropriate `sort`, the following should run in optimal O(X*log(N)) time, straight out of the box. `@topN = (sort @X)[0..($N-1)];` [download] -- All code is 100% tested and functional unless otherwise noted.	[reply] [d/l] [select]
Re^2: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 03, 2005 at 22:37 UTC
How would lazy lists allow that? You would still have to sort the whole array before you can perform the slice, because the last value in the array could be the highest. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re^3: Better mousetrap (getting top N values from list X) by sleepingsquirrel (Chaplain) on Feb 03, 2005 at 23:04 UTC
lazy sort -- All code is 100% tested and functional unless otherwise noted.	[reply]
Re^4: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 03, 2005 at 23:11 UTC
Re^5: Better mousetrap (getting top N values from list X) by sleepingsquirrel (Chaplain) on Feb 03, 2005 at 23:20 UTC
Some notes below your chosen depth have not been shown here
Re^5: Better mousetrap (getting top N values from list X) by sleepingsquirrel (Chaplain) on Feb 03, 2005 at 23:39 UTC
Some notes below your chosen depth have not been shown here
Re^3: Better mousetrap (getting top N values from list X) by Anonymous Monk on Feb 04, 2005 at 08:22 UTC
I don't understand. If you were to extract the largest element of the list, you'd make one pass and find it, even if the largest element is at the end. No need to sort the entire array here. The same with finding the topN. Instead of keeping track of the largest element so far, you keep track of the largest N elements so far. If you do it in a heap, you can add an element, or remove the smallest element in O(log N) time. Still need one pass through the array. Don't have to sort the array. Note that if N equals the size of the list, you perform a sort in O(N log N) time. Which is optimal.	[reply]
Re^4: Better mousetrap (getting top N values from list X) by BrowserUk (Patriarch) on Feb 04, 2005 at 08:47 UTC
Re: Better mousetrap (getting top N values from list X) by ambrus (Abbot) on Feb 05, 2005 at 22:33 UTC
You may want to see these two threads for some more information: finding top 10 largest files, Sorting values of nested hash refs.	[reply]


P is for Practical
	PerlMonks