My aesthetic sense is somewhat offended by scanning the list twice using grep. I would rather scan the list once and count both the defined and undefined in a single pass. Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.
The code below uses the result of defined to index an array. The array contains two elements, the number of undefined and the number of defined elements.
! /usr/bin/perl -w
use strict;
my %h = (
a => 1,
b => undef,
c => 1,
d => 1,
e => undef,
);
my @def;
$def[ defined $_ ? 1 : 0]++ for values %h;
print "undefined=$def[0], defined=$def[1]\n";
__END__
produces:
undefined=2, defined=3
(A short while later)...
Note that in both broquaint's code and mine, you are still iterating over the hash. There really is no way around that. You could, as the AM says, keep track of what you insert as you go along. But that approach offers you far more ways to go introduce bugs and would be far messier in terms of code. Or if you are brave and insist on this approach, you could try some fancy footwork with tie.
Just iterate and be done with it.
...(some time later)...
Just for kicks, I rolled a benchmark out of the two approaches:
except I screwed up
...(and then quite some time later)...
Pondering the results of my benchmark, I realised something was flawed. Calculating how much memory was involved compared to how many iterations were performed implied a bus speed tending towards the exabyte to zettabyte per second range... which meant that the benchmark wasn't testing what I thought it was testing. Indeed, jsprat arrived at the same conclusion independently.
Here, then, is a corrected benchmark that produces results that are far more reasonable. I just changed the cmpthese function as follows:
cmpthese( -300, {
byfor => sub { by_for( \%h ) },
bygrep => sub { by_grep( \%h ) },
}
);
(Yes, I had to up this to 300 seconds of CPU time to get good results)... which gives:
by_for: undefined=2398233, defined=9601767
by_grep: undefined=2398233, defined=9601767
Benchmark: running byfor, bygrep for at least 300 CPU seconds...
byfor: 311 wallclock secs (311.09 usr + 0.03 sys = 311.12 CPU) @
+ 0.06/s (n=20)
bygrep: 313 wallclock secs (312.95 usr + 0.04 sys = 312.99 CPU) @
+ 0.04/s (n=13)
s/iter bygrep byfor
bygrep 24.1 -- -35%
byfor 15.6 55% --
This is running on a 12 million element hash. This consumes 900Mb on my machine. (I forgot I had compiled the kernel to limit processes to consuming more than a gig, otherwise I could have pushed the size of the hash up further).
I should also note that this was run on 5.8.0, so I'm getting the benefit of the free scalar grep.
At this size the for approach has started to pull away from grep. Which approach you should use therefore depends on how big the hashes are :)
...(and yet still later after I read some more in the thread)... I included the difference algorithm (and I'm kicking myself for not having thought of it -- one more reason why this site is such a great place) which pulls way ahead. The changes are
sub by_diff {
my $h = shift;
my $defined = grep defined, values %$h;
(
scalar keys(%$h) - $defined,
$defined,
);
}
printf "by_for: undefined=%d, defined=%d\n", by_for( \%h );
printf "by_grep: undefined=%d, defined=%d\n", by_grep( \%h );
printf "by_diff: undefined=%d, defined=%d\n", by_diff( \%h );
cmpthese( -600, {
byfor => sub { by_for( \%h ) },
bygrep => sub { by_grep( \%h ) },
bydiff => sub { by_diff( \%h ) },
}
);
Which gives:
by_for: undefined=2398740, defined=9601260
by_grep: undefined=2398740, defined=9601260
by_diff: undefined=2398740, defined=9601260
Benchmark: running bydiff, byfor, bygrep for at least 600 CPU seconds.
+..
bydiff: 606 wallclock secs (604.93 usr + 0.11 sys = 605.04 CPU) @
+ 0.08/s (n=51)
byfor: 625 wallclock secs (623.32 usr + 0.09 sys = 623.41 CPU) @
+ 0.06/s (n=40)
bygrep: 604 wallclock secs (602.22 usr + 0.07 sys = 602.29 CPU) @
+ 0.04/s (n=25)
s/iter bygrep byfor bydiff
bygrep 24.1 -- -35% -51%
byfor 15.6 55% -- -24%
bydiff 11.9 103% 31% --
So there you have it. At this late stage of the game, if performance is still a worry, you should start thinking about writing in C, or using a database.
_____________________________________________ Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.
|