Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Counting keys with defined or undefined elements in a hash

by Bukowski (Deacon)
on Jun 05, 2003 at 16:21 UTC ( #263397=perlquestion: print w/ replies, xml ) Need Help??
Bukowski has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonks,

If I have a hash with elements that are either a value (1) or undef is there a quick and simple way of counting the number of keys which are undef or defined (1) without iterating over the entire hash?

I'm probably exposing a deeply ignorant postion with my knowledge of hashes here, but it is a speed issue - the hashes represent states of DNA bases in a genome and the fastest solution is absolutely required.

Many thanks in advance,

Bukowski - aka Dan (dcs@black.hole-in-the.net)
"Coffee for the mind, Pizza for the body, Sushi for the soul" -Userfriendly

Comment on Counting keys with defined or undefined elements in a hash
Re: Counting keys with defined or undefined elements in a hash
by Anonymous Monk on Jun 05, 2003 at 16:27 UTC
    Well your hash had to be populated at some point. I suggest that you wrap the hash-population code in a small subroutine that increments a counter variable every time a defined deposit is made.
Re: Counting keys with defined or undefined elements in a hash
by broquaint (Abbot) on Jun 05, 2003 at 16:27 UTC
    This does iterate over the whole hash, but it should be pretty nippy since the iteration is done internally
    my %hash = ( foo => 1, bar => 1, baz => 1, one => undef, two => undef, three => undef, ); print "defined - ", scalar(grep defined, values %hash), $/; print "not defined - ", scalar(grep !defined, values %hash), $/; __output__ defined - 3 not defined - 3
    See. grep and values for more info.
    HTH

    _________
    broquaint

      Broquaint - really love this solution ++ and many thanks its perfect

      Bukowski - aka Dan (dcs@black.hole-in-the.net)
      "Coffee for the mind, Pizza for the body, Sushi for the soul" -Userfriendly

      My aesthetic sense is somewhat offended by scanning the list twice using grep. I would rather scan the list once and count both the defined and undefined in a single pass. Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.

      The code below uses the result of defined to index an array. The array contains two elements, the number of undefined and the number of defined elements.

      ! /usr/bin/perl -w use strict; my %h = ( a => 1, b => undef, c => 1, d => 1, e => undef, ); my @def; $def[ defined $_ ? 1 : 0]++ for values %h; print "undefined=$def[0], defined=$def[1]\n"; __END__ produces: undefined=2, defined=3

      (A short while later)...

      Note that in both broquaint's code and mine, you are still iterating over the hash. There really is no way around that. You could, as the AM says, keep track of what you insert as you go along. But that approach offers you far more ways to go introduce bugs and would be far messier in terms of code. Or if you are brave and insist on this approach, you could try some fancy footwork with tie.

      Just iterate and be done with it.

      ...(some time later)...

      Just for kicks, I rolled a benchmark out of the two approaches:

      except I screwed up

      ...(and then quite some time later)...

      Pondering the results of my benchmark, I realised something was flawed. Calculating how much memory was involved compared to how many iterations were performed implied a bus speed tending towards the exabyte to zettabyte per second range... which meant that the benchmark wasn't testing what I thought it was testing. Indeed, jsprat arrived at the same conclusion independently.

      Here, then, is a corrected benchmark that produces results that are far more reasonable. I just changed the cmpthese function as follows:

      cmpthese( -300, { byfor => sub { by_for( \%h ) }, bygrep => sub { by_grep( \%h ) }, } );

      (Yes, I had to up this to 300 seconds of CPU time to get good results)... which gives:

      by_for: undefined=2398233, defined=9601767 by_grep: undefined=2398233, defined=9601767 Benchmark: running byfor, bygrep for at least 300 CPU seconds... byfor: 311 wallclock secs (311.09 usr + 0.03 sys = 311.12 CPU) @ + 0.06/s (n=20) bygrep: 313 wallclock secs (312.95 usr + 0.04 sys = 312.99 CPU) @ + 0.04/s (n=13) s/iter bygrep byfor bygrep 24.1 -- -35% byfor 15.6 55% --

      This is running on a 12 million element hash. This consumes 900Mb on my machine. (I forgot I had compiled the kernel to limit processes to consuming more than a gig, otherwise I could have pushed the size of the hash up further).

      I should also note that this was run on 5.8.0, so I'm getting the benefit of the free scalar grep.

      At this size the for approach has started to pull away from grep. Which approach you should use therefore depends on how big the hashes are :)

      ...(and yet still later after I read some more in the thread)... I included the difference algorithm (and I'm kicking myself for not having thought of it -- one more reason why this site is such a great place) which pulls way ahead. The changes are

      sub by_diff { my $h = shift; my $defined = grep defined, values %$h; ( scalar keys(%$h) - $defined, $defined, ); } printf "by_for: undefined=%d, defined=%d\n", by_for( \%h ); printf "by_grep: undefined=%d, defined=%d\n", by_grep( \%h ); printf "by_diff: undefined=%d, defined=%d\n", by_diff( \%h ); cmpthese( -600, { byfor => sub { by_for( \%h ) }, bygrep => sub { by_grep( \%h ) }, bydiff => sub { by_diff( \%h ) }, } );

      Which gives:

      by_for: undefined=2398740, defined=9601260 by_grep: undefined=2398740, defined=9601260 by_diff: undefined=2398740, defined=9601260 Benchmark: running bydiff, byfor, bygrep for at least 600 CPU seconds. +.. bydiff: 606 wallclock secs (604.93 usr + 0.11 sys = 605.04 CPU) @ + 0.08/s (n=51) byfor: 625 wallclock secs (623.32 usr + 0.09 sys = 623.41 CPU) @ + 0.06/s (n=40) bygrep: 604 wallclock secs (602.22 usr + 0.07 sys = 602.29 CPU) @ + 0.04/s (n=25) s/iter bygrep byfor bydiff bygrep 24.1 -- -35% -51% byfor 15.6 55% -- -24% bydiff 11.9 103% 31% --

      So there you have it. At this late stage of the game, if performance is still a worry, you should start thinking about writing in C, or using a database.

      _____________________________________________
      Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

        Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.

        In the latest versions of perl (as of 5.8.0, IIRC), grep and map won't build a list in void context.

        ----
        I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
        -- Schemer

        Note: All code is untested, unless otherwise stated

        My aesthetic sense is somewhat offended by scanning the list twice using grep.

        Mine too - as well as my common sense (no offense broquaint ;-)

        Here's a quick benchmark of my first thought (&for_values), my second thought (&grep_subtract) your method and broquaint's double grep.

        I'll just post the summary output from cmpthese: (perl 5.6.1) Rate for_array grep for grep_two for_array 82736/s -- -8% -12% -44% grep 90290/s 9% -- -4% -39% for 94074/s 14% 4% -- -37% grep_two 148846/s 80% 65% 58% --

        Using grep is deceptively fast - it looks like using the ternary operator in a single loop is slower than looping twice!

        By far the fastest of these is using keys to find the total number of hash elements and subtract the number of defined elements.

        I wonder how this would perform as the hash grows?

        Update: Moved Benchmark results outside of readmore...

        I msg'd you, but you weren't around, so I'll post here.

        The benchmark isn't benchmarking the giant hash - the parameter being passed to the sub is not what you think it is (which explains the amazing speed)! Just use %h directly in the sub to see the difference.

        Try using this modified sub by_for to show how the parameters are being passed:

        sub by_for { my $h = shift; die scalar keys %$h unless scalar keys %$h; my @def; $def[ defined $_ ? 1 : 0]++ for values %$h; @def; }

        It dies when the benchmark is run.

        Still, iterating over a 10_000_000 element list more than 160 thousand times a second isn't too shabby a performance...

        In fact, it's not shabby enough. That sort of speed should have made you take another look at your benchmark.

        Try changing your cmpthese() call to

        cmpthese( -5, { for => sub { by_for(\%h) }, grep => sub { by_grep(\%h) }, } );

        -sauoq
        "My two cents aren't worth a dime.";
        

      This should be about twice as fast for large hashes...

      my $defined = grep defined, values %h; my $notdefined = keys(%h) - $defined; print "defined - $defined\n"; print "not defined - $notdefined\n";

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: Counting keys with defined or undefined elements in a hash
by hardburn (Abbot) on Jun 05, 2003 at 16:30 UTC

    Dump hash data into a string using Data::Dumper. Take the resulting string and do s/^(.*?)undef(.*?)$//g. Un-serialize string back into a hash.

    It's ugly hackery (two .*? in one regex? Yuk!), may or may not be better than O(n) (depends on how s///g is implemented and your un-serailizing algorithm), and is just generally offensive. Admittedly, I don't know the details, but I'd probably live with a O(n) solution rather than spend time that could be somewhere in between O(1) and O(n**2) (or worse).

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

      since Data::Dumper's going to iterate over the hash anyway, i'm guessing this algorithm might be a fair bit slower than using perl's builtin grep function.

      ~Particle *accelerates*

Re: Counting keys with defined or undefined elements in a hash
by shemp (Deacon) on Jun 05, 2003 at 17:23 UTC
    Why not use a tied hash that reserves a key, call it "__UNDEFINED_COUNT__" for instance. Then the tied handlers will increment / decrement the count as appropriate. With this, there will be no need to do any extra counting, grepping, etc. You can just say
    my $undefined_count = $my_dna_hash->{'__UNDEFINED_COUNT__'};

    It would probably be good to write the tie so that the reserved variable cannot be directly altered, i.e. give a warning or die if if something external tries to modify it.
Re: Counting keys with defined or undefined elements in a hash
by antirice (Priest) on Jun 05, 2003 at 19:36 UTC
    *Considers getting -- on this post...ah..*

    OK. I don't know why this post strikes a nerve. I think it's just the original post. Maybe it's the feeling projected by some of the participants in this thread that this can be a very very efficient algorithm that may approach O(1)?? when in fact, it's O(n) at best. How do I know this? Quite simply, if you are presented a list without knowing anything about it prior and nothing can be deduced from any value from that list about other values contained therein, then EVERY value in the list must be checked. In short, yes you absolutely MUST iterate over the entire hash. grep does it. A for loop will do it. Maybe you can sort the values, nuh uh...that iterates over the entire hash (values, mind you...but with this option, why aren't you using grep?). Even the setup with the insertion value counter must do it (although, only once instead of repeating). The only benefit there is that you don't need to repeat a count everytime you insert values. Of course, the cost is pretty much creating your own mechanism to catch all changes to said hash.

    I don't know. It seems all this week I've been running into programmers who think you can squeeze blood out of a turnip. This is the fifth time I've seen this particular topic within a week. The last time was 15 minutes ago before I opened Perl Monks to settle down from having to convert the unbelieving (Who's been working on making his item counter faster for the past 3 hours. Three hours!!! !@$%!@#$^. Three hours trying to make something faster than grep which was based upon creating an array and sorting!! And he has a deadline coming up Tuesday and he "might get it done by then"). And then I open my browser, check out seekers of perl wisdom and... *sigh*. Sorry to lash out. I just felt that someone had to say it.

    If I hurt anyone's feelings, I'm truly sorry. I'm just trying to vent and be somehow reassured that someone else can see what I see :-/

    antirice    
    The first rule of Perl club is - use Perl
    The
    ith rule of Perl club is - follow rule i - 1 for i > 1

      Maybe you could direct your colleague to PerlMonks that he might gain wisdom and see the error of his ways...

      -- Mike

      --
      just,my${.02}

      I must agree that many people dont understand (that there exist) limitations of algorithmic efficiency. For Perl programmers, this could be particularly predominant because Perl has so many cool built-ins. Often we forget that something powerful like grep or map still is O(n) when iterating thru an array. There are many other subtle (to the uninitiated) things Perl does that are oft overlooked, such as that foreach creates an anonymous array of what is being iterated over. But, this particular one is mentioned over and over in most Perl literate.
      Update: My old school perl is showing, foreach does not create a copy of the list. (thanks to jsprat for pointing this out to me)

      In any case, i think that most people learn the portions of a language that are necessary to the task at hand, and later on (maybe) pick up the bits and pieces in between. We're all probably there to some extent, in fact i have been for a while now trying to figure out how to best find and fill some of the holes in my perl knowledge. (I've been programming Perl since 1994.)

      Anyway, I think this would be a good discussion that may want to be moved to meditations.

      Also, to again plug my tied hash solution to the problem, the tie should be only a few lines of code (50ish ?), and once programmed, the run time penalty for hash modifications should only add a few percent (like 5%) to the overall time of hash operations. I have not benchmarked this tie, as i havent written it, the estimate is from my experience from other simple ties that ended up with small performance penalties. shemp
Re: Counting keys with defined or undefined elements in a hash
by Oberon (Monk) on Jun 07, 2003 at 00:46 UTC

    I always thought that using values created a new array (as opposed to using each, which didn't), but nobody mentioned it here, so I'm guessing I'm wrong. Is this something that used to be true but now isn't (and, if so, does anybody know as of what version), or am I just totally confoozled on this one?

    I like the suggestion of a tied hash, but it could be problematic if it's possible for the values to change from undef to 1 (or vice versa).

      It does create a new list, but aliases the scalars rather than copy them. Observe:
      #!/usr/bin/perl -lw use strict; my %foo = (0 => 0, 1 => 1, 2 => 2); $_++ for values %foo; print "$_ => $foo{$_}" for sort keys %foo;

      Makeshifts last the longest.

        > It does create a new list, but aliases the scalars rather than copy them.

        Right, but I thought that using each didn't create a new list. But slapping a quick Benchmark together I see that keys is nearly 3x faster than each, and values is more than twice as fast as keys ... that totally blows my understanding of the process. And here I've been favoring each over keys or values on the grounds that it used less memory and was faster ... man have I been wrong!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://263397]
Approved by broquaint
Front-paged by halley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2014-09-18 19:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (121 votes), past polls