revdiablo has asked for the wisdom of the Perl Monks concerning the following question:
Hello fair Monks. I am calculating the average of a given range of hash values. For example, given a hash like so:
%hash = (
100 => '1',
101 => '23',
102 => '4',
...
200 => '75'
);
I'd want to, say, calculate the average of the values at keys 100 to 125. I have a routine that does what I want, but it seems to be a bit slow. I can't think of any major optimizations (other than switching to an array, which isn't a good idea, since the hash keys are relatively large numbers). Perhaps someone has some ideas?
Here is my existing code:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $data = {};
$data->{$_} = rand for (10000 .. 30000);
print average($data, 15000, 25000), "\n";
cmpthese(50, { orig => sub { average($data, 15000, 25000) } });
sub average {
my ($data, $start_point, $end_point) = @_;
my @keys = sort keys %{ $data };
my $start_pos;
my $end_pos;
for (0 .. $#keys) {
$start_pos = $_ if $keys[$_] eq $start_point;
$end_pos = $_ if $keys[$_] eq $end_point;
}
return undef unless defined $start_pos and defined $end_pos;
my $count = $end_pos - $start_pos + 1;
my $amount;
$amount += $data->{$keys[$_]} for ($start_pos .. $end_pos);
return $amount / $count;
}
Any help or ideas will be greatly appreciated. In case anybody wonders, I am using this routine to determine a moving average. I am running it many times against barely-differing ranges, so it slows down fairly quickly.
Re: calculate average from range of hash values (73%)
by BrowserUk (Patriarch) on Jul 27, 2003 at 01:30 UTC
|
If your lucky enough to have access to the XS version of List::Util this would probably run much faster still.
sub buk {
use List::Util qw[sum];
my( $hashref, $start, $end ) = @_;
# Added sanity check. Slowed it a tad
return undef unless exists $hashref->{$start}
and exists $hashref->{$end};
my @keys = grep{ $_ >= $start && $_ <= $end } keys %{ $hashref };
return sum( @{ $hashref }{ @keys } ) / @keys;
}
__END__
P:\test>278169
0.503353665487943
0.503353665487943
Rate orig buk
orig 1.28/s -- -42%
buk 2.23/s 73% --
Second attempt (417%)
Once I noticed that your code assumes that a contiguous range of keys will exist between the start and end values, it can go much quicker.
sub buk2 {
use List::Util qw[sum];
return undef unless exists $_[0]->{ $_[1] }
and exists $_[0]->{ $_[2] };
return sum( @{ $_[0] }{ $_[1] .. $_[2] } ) / ( $_[2] - $_[1] + 1 )
+;
}
__END__
P:\test>278169
0.503353665487943
0.503353665487943
0.503353665487943
Rate orig buk buk2
orig 1.30/s -- -41% -81%
buk 2.19/s 69% -- -67%
buk2 6.70/s 417% 206% --
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] [d/l] [select] |
Re: calculate average from range of hash values (optimize)
by Zaxo (Archbishop) on Jul 27, 2003 at 01:52 UTC
|
I don't see the need to sort all the keys in %{$data}. You can use hash slices and List::Utils::sum() to good advantage.
use List::Utils 'sum';
sub average {
my ($data, $first, $last) = @_;
my %data = %$data;
my @goodkeys = grep {exists $data{$_}} $first .. $last;
sum( @data{ @goodkeys } ) / @goodkeys;
}
After Compline, Zaxo | [reply] [d/l] |
|
Nor do you need to make a copy of the hash.
use List::Util qw(sum);
sub average {
my ($data, $first, $last) = @_;
my @goodkeys = grep exists $data->{$_}, $first .. $last;
sum( @{$data}{@goodkeys} ) / @goodkeys;
}
Finally, that we're doing math on the values without checking their definedness suggests that valid values are never undef, in which case you can skip the double lookup.
use List::Util qw(sum);
sub average {
my ($data, $first, $last) = @_;
my @values = grep defined, @{$data}{$first .. $last};
sum(@values) / @values;
}
That should be about as fast as you can get in Perl.
Makeshifts last the longest. | [reply] [d/l] [select] |
|
For the record your second solution corrupts the hash with keys whose value is undef. While the OP didnt explicitly say this was bad, I believe it was implicit in the comment that the keys could be very large or very small. Thus a search could inadverdantly create millions of keys and overflow available RAM.
---
demerphq
<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
| [reply] [d/l] |
|
Re: calculate average from range of hash values (optimize)
by cleverett (Friar) on Jul 27, 2003 at 01:49 UTC
|
Hmm, rolling averages. You can convert that from N**2 to linear, if you can assume you're moving through the numbers in sequence. Also, converting your hash to an array would be better.
Try something like this:
#!/usr/bin/perl
use strict;
use warnings;
## setup
my ($gap, $elements, $first, $i) = (10000, 20000, 10000, 0);
my @data = map { rand } (1 .. $elements);
## compute initial sum
my $sum = 0;
$sum += $_ for @data[0 .. $gap];
## go through @data serially computing averages
while (1) {
printf("average of %d to %d: %d",
$i + $first, $i + $first + $gap, $sum / $gap);
$sum -= $data[$i]; # subtract old 1st element
last if ++$i + $gap > $elements; # stop when done
$sum += $data[$i + $gap]; # add new last element
}
HTH | [reply] [d/l] |
|
Thanks for the reply, cleverett, but there are a few reasons it is not exactly applicable to my situation. One, which I specifically mentioned in my post, is I would like to avoid using an array. The keys are Quite Large Numbers (for example '1059159920'), and trying to add even one array entry with an index that high runs out of memory. Another reason that I didn't mention, or even think about when I first made my post, is the keys are relatively sparse. I can't simply roll through the range in sequence; I have to make sure the keys exist.
That said, your post does give me some food for thought, and I might just steal some of your ideas. :)
| [reply] |
|
Perhaps keep an index to the original data hash, thusly:
use strict;
## simulate original %data hash
my %data = map { (int(rand(10**10) + 10**9) => rand } (0 .. 2*10**5);
my ($keys, $values) = ([], []); ## remember what goes where
foreach (sort keys %data) {
push @$keys, $_;
push @$values, $data{$_};
}
## now obtain rolling averages
| [reply] [d/l] |
Re: calculate average from range of hash values (optimize)(reply)
by demerphq (Chancellor) on Jul 27, 2003 at 14:58 UTC
|
Hmm, well this happens to be very similar to a problem that I have been working on. With respect to this node the primary difference btween our problems is that I cant simply return undef if start_pos is not in the hash. In other words my problem would be to find the average of the values where the keys are within a range, regardless of whether the first and last actually exists.
Of the solutions posted here the Zaxo/aristotle(1)/browseruk(1) variant fits. For reasons decribed in the reply to Aristotle I've ruled his second one out, and Ive ruled browserUk's second out because my data set isn't contiguous at all. Likewise I didn't include the two variants that take arrays as arguments as they don't scale well for scattered data sets. (IMO, they are welcome and encouraged to adjust their code and add it to the benchmark.)
My initial reaction to this node and your code was that it was way more complex that it seemed to need to be. And for this reason I put it in the benchmark even though it violates my rule by returning undef if the start pos doesnt exist. One might guess that this requirement would make your code infinitely fast in this case, (as does browseruks variant) but even then its one of the slowest in my benchmark, furthermore, as far as I can tell this code is incorrect, in that the sort statement does a lexicographical sort and not a numerical sort and thus would include 11 in the average of 1..3 in a data set containing 1..11. (Your test case happens to hide this, a good example of the difficulties in doing a really good comparative benchmark.)
My first solution was dem1. Its a naive single pass through the data set, filtering out keys whose value ar out of range, and keeping a count and sum for the final calculation. The second solution I did was dem2 which uses the sum() routine from List::Util. The third solution I did was a perl algorithm written to be easily translatable to C. It uses caching to speed itself up, using binsearch and a linear walk to do its business. The fourth version is the third translated as directly to C as possible. For these version I normally would use a class (or a tie) to handle the cache, thus being able to spoil the cache when an insert occurs. Instead the code currently has a fourth parameter to force a cache reset.
When you benchmark the dataset you test against is critical. Does it represent your actual data? Have you tested the behaviour as it changes? For instance in the results I will show below, the cached c/perl versions massively outperform the others. However this would almost certainly would not be true if the $data hash was changing regularly. Again, the behaviour of ari1 over the different test sets is interesting as it shows how subject it is to the spareness of the data set involved. Which just goes to show that truly optimizing this will require far more information about typical data involved. Is the data sparse? Dense? Does it change often in between lookups, will really govern optimizing this code. It would be nice to know a bit more about the usage pattern. Anyway, ive attempted to allow for a decent range of behaviour testing in my version of the benchmark. Although I havent touched the spoiled cache aspect.
In order to make this easier to read ill post the final result her and ill put the code and overall results in two replies with readmores. Heres the overall benchmark, of the average of the keys (1..100,10000 .. 30000,100000..150000) with the range 1..150000.
Testing 1 - 150000
ari1 = 0.500910
orig = 0.500190
demc = 0.500910
ofix = 0.500910
buk = 0.500910
dem1 = 0.500910
demcp = 0.500910
dem2 = 0.500910
Benchmark: running ari1, buk, dem1, dem2, demc, demcp, ofix, orig,
each for at least 2 CPU seconds...
ari1: 2 wc secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 3.70/s (
+n=8)
buk: 2 wc secs ( 2.18 usr + 0.00 sys = 2.18 CPU) @ 3.66/s (
+n=8)
dem1: 2 wc secs ( 2.03 usr + 0.00 sys = 2.03 CPU) @ 5.41/s (
+n=11)
dem2: 3 wc secs ( 2.24 usr + 0.00 sys = 2.24 CPU) @ 3.57/s (
+n=8)
demc: 2 wc secs ( 2.03 usr + 0.00 sys = 2.03 CPU) @ 1424.99/s
+ (n=2897)
demcp: 2 wc secs ( 2.09 usr + 0.00 sys = 2.09 CPU) @ 14.33/s (
+n=30)
ofix: 2 wc secs ( 2.20 usr + 0.00 sys = 2.20 CPU) @ 3.18/s (
+n=7)
orig: 2 wc secs ( 2.06 usr + 0.00 sys = 2.06 CPU) @ 3.39/s (
+n=7)
Rate ofix orig dem2 buk ari1 dem1 demcp demc
ofix 3.18/s -- -6% -11% -13% -14% -41% -78% -100%
orig 3.39/s 7% -- -5% -7% -8% -37% -76% -100%
dem2 3.57/s 12% 5% -- -3% -4% -34% -75% -100%
buk 3.66/s 15% 8% 3% -- -1% -32% -74% -100%
ari1 3.70/s 16% 9% 4% 1% -- -32% -74% -100%
dem1 5.41/s 70% 59% 52% 48% 46% -- -62% -100%
demcp 14.3/s 351% 322% 302% 291% 288% 165% -- -99%
demc 1425/s 44746% 41896% 39853% 38784% 38428% 26236% 9842% --
Anyway, this was fun. :-)
---
demerphq
<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
| [reply] [d/l] [select] |
|
( 0 .. 100, 10000 .. 30000, 100000 .. 150000 )
Testing 0 - 150000
Routine | Average |
ari1 | 94862.56 |
orig | 114752.28 |
demc | 94862.56 |
ari3 | 94862.56 |
ofix | 94862.56 |
buk | 94862.56 |
dem1 | 94862.56 |
demcp | 94862.56 |
dem2 | 94862.56 |
Benchmark: running ari1, ari3, buk, dem1,
dem2, demc, demcp, ofix, orig, each for
at least 10 CPU seconds...
Test | Wallclock Secs | Usr | Sys | CUsr | CSys | CPU | N/sec | N |
---|
ari1 | 10.00 | 10.01 | 0.00 | 0.00 | 0.00 | 10.01 | 3.30 | 33 |
ari3 | 11.00 | 10.27 | 0.00 | 0.00 | 0.00 | 10.27 | 1.85 | 19 |
buk | 10.00 | 10.09 | 0.00 | 0.00 | 0.00 | 10.09 | 3.47 | 35 |
dem1 | 10.00 | 10.07 | 0.00 | 0.00 | 0.00 | 10.07 | 5.16 | 52 |
dem2 | 10.00 | 10.20 | 0.00 | 0.00 | 0.00 | 10.20 | 2.35 | 24 |
demc | 11.00 | 10.51 | 0.00 | 0.00 | 0.00 | 10.51 | 1432.52 | 15063 |
demcp | 11.00 | 10.52 | 0.00 | 0.00 | 0.00 | 10.52 | 13.03 | 137 |
ofix | 10.00 | 10.10 | 0.00 | 0.00 | 0.00 | 10.10 | 2.08 | 21 |
orig | 10.00 | 10.08 | 0.00 | 0.00 | 0.00 | 10.08 | 3.37 | 34 |
Comparative Results
| Rate | ari3 | ofix | dem2 | ari1 | orig | buk | dem1 | demcp | demc |
ari3 | 1.85/s | -- | -11% | -21% | -44% | -45% | -47% | -64% | -86% | -100% |
ofix | 2.08/s | 12% | -- | -12% | -37% | -38% | -40% | -60% | -84% | -100% |
dem2 | 2.35/s | 27% | 13% | -- | -29% | -30% | -32% | -54% | -82% | -100% |
ari1 | 3.30/s | 78% | 59% | 40% | -- | -2% | -5% | -36% | -75% | -100% |
orig | 3.37/s | 82% | 62% | 43% | 2% | -- | -3% | -35% | -74% | -100% |
buk | 3.47/s | 87% | 67% | 47% | 5% | 3% | -- | -33% | -73% | -100% |
dem1 | 5.16/s | 179% | 148% | 119% | 57% | 53% | 49% | -- | -60% | -100% |
demcp | 13.0/s | 604% | 527% | 454% | 295% | 286% | 276% | 152% | -- | -99% |
demc | 1433/s | 77294% | 68825% | 60806% | 43371% | 42349% | 41218% | 27655% | 10895% | -- |
---
demerphq
<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
| [reply] [d/l] [select] |
|
The code for the benchmark. I happen to have a hacked Benchmark.pm that includes support for HTML benchmarks (er, I do now anyway, thats why the code took so long to follow, I sort of got distracted :-), so this was written with that in mind. However as posted it should run just fine on your system. (Assuming you have a C compiler and etc.)
Updates:
-
I forgot to point out that I changed the way the values are set in the test set from rand() to being the same as the key. This makes spotting errors and proving correctness much easier.
-
Added aristotles new variant.
---
demerphq
<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
| [reply] [d/l] [select] |
Re: calculate average from range of hash values (optimize)
by fglock (Vicar) on Jul 27, 2003 at 03:09 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $data = [];
my $start = 10000;
my $end = 30000;
$data->[$_ - $start] = rand for ($start .. $end);
print average($data, 15000, 25000), "\n";
cmpthese(50, { orig => sub { average($data, 15000, 25000) } });
sub average {
my ($data, $start_point, $end_point) = @_;
my $count = $end_point - $start_point + 1;
my $amount;
$amount += $data->[$_ - $start] for ($start_point .. $end_point);
return $amount / $count;
}
| [reply] [d/l] |
Re: calculate average from range of hash values (optimize)
by TomDLux (Vicar) on Jul 27, 2003 at 17:28 UTC
|
The important aspect of moving averages is to minimimize the complexty of the N calculations. Calculate the sum, once, then add the new value and subtract the old value. Divide by N to obtain the average.
--
TTTATCGGTCGTTATATAGATGTTTGCA
| [reply] |
|
|