http://www.perlmonks.org?node_id=263412


in reply to Re: Counting keys with defined or undefined elements in a hash
in thread Counting keys with defined or undefined elements in a hash

My aesthetic sense is somewhat offended by scanning the list twice using grep. I would rather scan the list once and count both the defined and undefined in a single pass. Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.

The code below uses the result of defined to index an array. The array contains two elements, the number of undefined and the number of defined elements.

! /usr/bin/perl -w use strict; my %h = ( a => 1, b => undef, c => 1, d => 1, e => undef, ); my @def; $def[ defined $_ ? 1 : 0]++ for values %h; print "undefined=$def[0], defined=$def[1]\n"; __END__ produces: undefined=2, defined=3

(A short while later)...

Note that in both broquaint's code and mine, you are still iterating over the hash. There really is no way around that. You could, as the AM says, keep track of what you insert as you go along. But that approach offers you far more ways to go introduce bugs and would be far messier in terms of code. Or if you are brave and insist on this approach, you could try some fancy footwork with tie.

Just iterate and be done with it.

...(some time later)...

Just for kicks, I rolled a benchmark out of the two approaches:

except I screwed up

#! /usr/bin/perl -w use strict; use Benchmark qw/cmpthese/; my %h; my $nr = shift || 10000000; $h{$nr} = ((rand() > 0.2) ? 1 : undef) while $nr--; sub by_for { my $h = shift; my @def; $def[ defined $_ ? 1 : 0]++ for values %$h; @def; } sub by_grep { my $h = shift; ( scalar (grep !defined, values %$h), scalar (grep defined, values %$h), ); } printf "by_for: undefined=%d, defined=%d\n", by_for( \%h ); printf "by_grep: undefined=%d, defined=%d\n", by_grep( \%h ); cmpthese( -5, { for => ' by_for( \%h ) ', grep => ' by_grep( \%h ) ', } );

Be careful! As it stands, this code creates a hash with 10 million elements. On my machine it soaks up 800Mb of RAM. The results show that the scalar grep, somewhat to my surprise, consistently outperforms the for approach:

by_for: undefined=2000300, defined=7999700 by_grep: undefined=2000300, defined=7999700 Benchmark: running for, grep for at least 5 CPU seconds... for: 5 wallclock secs ( 5.27 usr + 0.00 sys = 5.27 CPU) @ 16 +3521.04/s (n=862318) grep: 6 wallclock secs ( 5.05 usr + 0.00 sys = 5.05 CPU) @ 22 +4222.12/s (n=1131621) Rate for grep for 163521/s -- -27% grep 224222/s 37% --

Still, iterating over a 10_000_000 element list more than 160 thousand times a second isn't too shabby a performance...

...(and then quite some time later)...

Pondering the results of my benchmark, I realised something was flawed. Calculating how much memory was involved compared to how many iterations were performed implied a bus speed tending towards the exabyte to zettabyte per second range... which meant that the benchmark wasn't testing what I thought it was testing. Indeed, jsprat arrived at the same conclusion independently.

Here, then, is a corrected benchmark that produces results that are far more reasonable. I just changed the cmpthese function as follows:

cmpthese( -300, { byfor => sub { by_for( \%h ) }, bygrep => sub { by_grep( \%h ) }, } );

(Yes, I had to up this to 300 seconds of CPU time to get good results)... which gives:

by_for: undefined=2398233, defined=9601767 by_grep: undefined=2398233, defined=9601767 Benchmark: running byfor, bygrep for at least 300 CPU seconds... byfor: 311 wallclock secs (311.09 usr + 0.03 sys = 311.12 CPU) @ + 0.06/s (n=20) bygrep: 313 wallclock secs (312.95 usr + 0.04 sys = 312.99 CPU) @ + 0.04/s (n=13) s/iter bygrep byfor bygrep 24.1 -- -35% byfor 15.6 55% --

This is running on a 12 million element hash. This consumes 900Mb on my machine. (I forgot I had compiled the kernel to limit processes to consuming more than a gig, otherwise I could have pushed the size of the hash up further).

I should also note that this was run on 5.8.0, so I'm getting the benefit of the free scalar grep.

At this size the for approach has started to pull away from grep. Which approach you should use therefore depends on how big the hashes are :)

...(and yet still later after I read some more in the thread)... I included the difference algorithm (and I'm kicking myself for not having thought of it -- one more reason why this site is such a great place) which pulls way ahead. The changes are

sub by_diff { my $h = shift; my $defined = grep defined, values %$h; ( scalar keys(%$h) - $defined, $defined, ); } printf "by_for: undefined=%d, defined=%d\n", by_for( \%h ); printf "by_grep: undefined=%d, defined=%d\n", by_grep( \%h ); printf "by_diff: undefined=%d, defined=%d\n", by_diff( \%h ); cmpthese( -600, { byfor => sub { by_for( \%h ) }, bygrep => sub { by_grep( \%h ) }, bydiff => sub { by_diff( \%h ) }, } );

Which gives:

by_for: undefined=2398740, defined=9601260 by_grep: undefined=2398740, defined=9601260 by_diff: undefined=2398740, defined=9601260 Benchmark: running bydiff, byfor, bygrep for at least 600 CPU seconds. +.. bydiff: 606 wallclock secs (604.93 usr + 0.11 sys = 605.04 CPU) @ + 0.08/s (n=51) byfor: 625 wallclock secs (623.32 usr + 0.09 sys = 623.41 CPU) @ + 0.06/s (n=40) bygrep: 604 wallclock secs (602.22 usr + 0.07 sys = 602.29 CPU) @ + 0.04/s (n=25) s/iter bygrep byfor bydiff bygrep 24.1 -- -35% -51% byfor 15.6 55% -- -24% bydiff 11.9 103% 31% --

So there you have it. At this late stage of the game, if performance is still a worry, you should start thinking about writing in C, or using a database.

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

Replies are listed 'Best First'.
Re: Re:x2 Counting keys with defined or undefined elements in a hash
by jsprat (Curate) on Jun 05, 2003 at 17:32 UTC
    My aesthetic sense is somewhat offended by scanning the list twice using grep.

    Mine too - as well as my common sense (no offense broquaint ;-)

    Here's a quick benchmark of my first thought (&for_values), my second thought (&grep_subtract) your method and broquaint's double grep.

    I'll just post the summary output from cmpthese: (perl 5.6.1) Rate for_array grep for grep_two for_array 82736/s -- -8% -12% -44% grep 90290/s 9% -- -4% -39% for 94074/s 14% 4% -- -37% grep_two 148846/s 80% 65% 58% --

    Using grep is deceptively fast - it looks like using the ternary operator in a single loop is slower than looping twice!

    By far the fastest of these is using keys to find the total number of hash elements and subtract the number of defined elements.

    I wonder how this would perform as the hash grows?

    Update: Moved Benchmark results outside of readmore...

      Using grep is deceptively fast
      Since it performs the iteration internally it is bound to be very fast indeed, and scalar context will also help as it saves on the assignment. Will also do my best not to offend anyone's aesthetic sensibilities in future ;)
      HTH

      _________
      broquaint

        The key difference between the two (in this case, at least) is the conditional expression. A plain for loop will iterate faster than grep - but insert a conditional into the for loop, grep will win. Side note, in this thread I learned that grep in scalar context doesn't build the list, it just "returns the number of times the expression was true."*

        * ripped directly from perldoc -f grep

        And by the way, if you saw how my apartment was decorated before I got married, you'd never worry about my aesthetic sensibilities again ;)

Re: Re:x2 Counting keys with defined or undefined elements in a hash
by hardburn (Abbot) on Jun 05, 2003 at 17:25 UTC

    Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.

    In the latest versions of perl (as of 5.8.0, IIRC), grep and map won't build a list in void context.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: Re:x2 Counting keys with defined or undefined elements in a hash
by sauoq (Abbot) on Jun 05, 2003 at 19:19 UTC
    Still, iterating over a 10_000_000 element list more than 160 thousand times a second isn't too shabby a performance...

    In fact, it's not shabby enough. That sort of speed should have made you take another look at your benchmark.

    Try changing your cmpthese() call to

    cmpthese( -5, { for => sub { by_for(\%h) }, grep => sub { by_grep(\%h) }, } );

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Re:x2 Counting keys with defined or undefined elements in a hash
by jsprat (Curate) on Jun 05, 2003 at 18:46 UTC
    I msg'd you, but you weren't around, so I'll post here.

    The benchmark isn't benchmarking the giant hash - the parameter being passed to the sub is not what you think it is (which explains the amazing speed)! Just use %h directly in the sub to see the difference.

    Try using this modified sub by_for to show how the parameters are being passed:

    sub by_for { my $h = shift; die scalar keys %$h unless scalar keys %$h; my @def; $def[ defined $_ ? 1 : 0]++ for values %$h; @def; }

    It dies when the benchmark is run.