comment on

If it is just a randomly selected 10 items with the highest value you are after, beware of 'add one and (re)sort and discard' algorithms.

O(logN) sorting algorithms are pretty efficient for large sets of data, but (relatively) pretty lousy at small sets.

Eg. log(100e6) = 18.42 => 0.000018%; but log(10) = 23%.

So, given your ~715e6 records, the 'add 1, resort and discard 1' algorithms do 715 million * O(log 11) sorts, which is ~= O(745e6).

Conversely, if you add 1000 items, sort and discard 1000, you get 715 thousand * O(log 1010) sorts, which is ~= O(10e6).

And (theoretically), if you add 1e6 items, sort, and disarcd 1e6 then you get 715 * O(log 1000010) sorts, which is ~= O(9878).

Of course, these CompSci theoretical complexity estimates tend to ignore the cost of memory allocations, deallocations, etc. and thus never work out that way in reality, but they do give an indication of where to look for savings.

This compares Add 1 and Add many algorithms for various values of many, and adding somewhere between 100 and 1000 items at a time seems to work best for an entirely in-memory, cpu-bound process:

#! perl -slw
use strict;
use List::Util qw[ sum min ];
use Time::HiRes qw[ time ];

sub timeit(&) {
    my $soFar = sum( times() );
    my $start = time;
    $_[0]->();
    return sprintf "Wall:%6.3f secs; cpu:%6.3f secs",
        time() - $start, sum( times() ) - $soFar;
}

our $T //= 10e6;
our $N //= 10;
our $B //= 10;

my @data = map{ sprintf "%4.2f", rand() } 1 .. 10e6;

print 'Add 1   and sort; add 1   and sort ...: ', timeit {
    my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ];

    for( my $i = $N; $i < $T; ++$i ) {
        @top = ( sort{ $b <=> $a } @top, $data[ $i ] )[ 0 .. $N-1 ];
    }
#    print "@top";
};

print "Add $B and sort; add $B and sort ...: ", timeit{
    my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ];

    for( my $i = $N; $i < $T; $i += $B ) {
        @top = sort{ $b <=> $a } @top, @data[ $i .. min( $i+$B, $#data
+ ) ];
        $#top = $N;
    }
#    print "@top";
};

__END__
C:\test>1052498-3 -T=10e6 -B=1e1
Add 1   and sort; add 1   and sort ...: Wall:89.400528 secs; cpu:89.04
+7000 secs
Add 1e1 and sort; add 1e1 and sort ...: Wall:18.483192 secs; cpu:18.39
+1000 secs

C:\test>1052498-3 -T=10e6 -B=1e2
Add 1   and sort; add 1   and sort ...: Wall:87.353784 secs; cpu:86.43
+8000 secs
Add 1e2 and sort; add 1e2 and sort ...: Wall:9.385142 secs; cpu:9.2340
+00 secs

C:\test>1052498-3 -T=10e6 -B=2e2
Add 1   and sort; add 1   and sort ...: Wall:87.380 secs; cpu:87.046 s
+ecs
Add 2e2 and sort; add 2e2 and sort ...: Wall: 9.135 secs; cpu: 9.156 s
+ecs

C:\test>1052498-3 -T=10e6 -B=1e3
Add 1   and sort; add 1   and sort ...: Wall:87.436786 secs; cpu:86.62
+5000 secs
Add 1e3 and sort; add 1e3 and sort ...: Wall:9.377329 secs; cpu:9.3290
+00 secs

C:\test>1052498-3 -T=10e6 -B=1e4
Add 1   and sort; add 1   and sort ...: Wall:87.435 secs; cpu:86.298 s
+ecs
Add 1e4 and sort; add 1e4 and sort ...: Wall:10.077 secs; cpu: 9.843 s
+ecs
[download]

Now, if someone decides to try that where the data is greater than memory and so coming from an external file, they'll probably find that the IO-limited nature means that 1 or 1000 at a time make little or no difference to the overall elapsed time.

But, if you then look at the CPU used, the 1 at a time will be an order of magnitude, or more, higher. And cycles cost money. There is little purpose in burning extra cycles to achieve the same outcome in the same time.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by BrowserUk
in thread Limit the size of a hash by Dr Manhattan

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks