http://www.perlmonks.org?node_id=36140

rdw has asked for the wisdom of the Perl Monks concerning the following question:

Having been caught once again by 'each' not resetting, I decided to do some benchmarks to compare the performance of a while loop with 'each' and a foreach loop with 'keys'.

The results surprised me, so I'm seeking enlightenment - can anybody explain this one?

Here is the code I tried... (obviously I could have used 'values' in the second case, but this code was constructed just for benchmarking)
#! /usr/bin/perl -w use Benchmark qw(cmpthese); use strict; use integer; my %hash; for (my $i = 0; $i < 1000; $i++) { $hash{$i} = $i; } sub e { my $total = 0; keys %hash; while (my($k, $v) = each %hash) { $total += $v; last if $total > 500; } } sub k { my $total = 0; foreach my $k (keys %hash) { my $v = $hash{$k}; $total += $v; last if $total > 500; } } cmpthese(100000, { 'each' => \&e, 'keys' => \&k });

And here are the results...

Benchmark: timing 100000 iterations of each, keys...
      each:  1 wallclock secs ( 0.61 usr +  0.01 sys =  0.62 CPU) @ 161290.32/s (n=100000)
      keys: 93 wallclock secs (90.03 usr +  0.77 sys = 90.80 CPU) @ 1101.32/s (n=100000)
         Rate   keys   each
keys   1101/s     --   -99%
each 161290/s 14545%     --

...but... (and this is the weird bit) if I comment out the line which breaks out of the loop ('last if...') then the results are completely different... (I dropped the number of iterations - hope this hasn't changed too much - I got bored waiting for it)

Benchmark: timing 1000 iterations of each, keys...
      each:  2 wallclock secs ( 2.77 usr +  0.00 sys =  2.77 CPU) @ 361.01/s (n=1000)
      keys:  3 wallclock secs ( 2.36 usr +  0.00 sys =  2.36 CPU) @ 423.73/s (n=1000)
      Rate each keys
each 361/s   -- -15%
keys 424/s  17%   --

So it seems that 'each' is much quicker, but only if you plan on leaving the loop early - which is when the not resetting gotcha kicks in.

Now I understand that 'keys' would have to generate a list which might be expensive I guess, but I'm still a bit puzzled by the results. Perhaps there is something else going on.

Have fun,

rdw

Replies are listed 'Best First'.
Re: each or keys?
by tilly (Archbishop) on Oct 11, 2000 at 02:30 UTC
    keys just loops over calls to each, but does it somewhat more efficiently than you can in Perl. Which is why when you return the whole list, keys is modestly faster.

    But when you leave the loop very early with each, well that is a lot faster than going through all of them! (Hint, you are breaking out after an entry or two.)

    UPDATE
    My wording confused people. A foreach loop has to build a complete list of values before it does any processing. (Except for a hack for the range operator.) A while loop processes entries as it walks through the list. So the keys approach has to build the whole list even though most will be ignored, the each approach does not. But when you go through all of them, the each approach is doing basically what foreach did up front.

    That make sense?

      And foreach over (@array) doesn't build a new array, and foreach over (<FILEHANDLE>) doesn't slurp the whole file either. Not really a hack there. That is what foreach is for. Building a quick list to foreach over is cool, but throwing a huge list together in a hurry is hard, and worse, eats scads of memory.

      The while loop is much better for large hashes like this case.

      --
      $you = new YOU;
      honk() if $you->love(perl)

        Only some of those are true.

        It is true that foreach over (@array) uses the original array as a list. It is also true that foreach over (1..1_000_000) does not build a list of a million entries. But the file construct most assuredly does slurp a file. Try running the following command-lines interactively to show that:

        perl -e '$| = 1; print while <>'
        vs
        perl -e '$| = 1; print foreach <>'
        You will find that the first spits back your lines interactively. The second has to wait to slurp up everything you have to say before it starts printing.

        Should you ever need to write a filter for dealing with a large amount of data, be very, very careful to use while instead of foreach!

        But a random note. If you check the benchmark, you will find that keys ran faster, 423.73 iterations per second vs 356.01. If memory is not a constraint, then for hashes of this size, foreach is faster!

      Doh! You're right of course, how silly not to notice it before - the break out test is completely stupid. (I guess in the back of my mind I must have thought the values were coming out in order - which of course they are not). It breaks out very early and that explains why it's so much faster.

      If I change the test to >400000 (which gets me about 3/4 through the hash) then I get the following results...

      Benchmark: timing 1000 iterations of each, keys...
            each:  2 wallclock secs ( 2.39 usr +  0.00 sys =  2.39 CPU) @ 418.41/s (n=1000)
            keys:  2 wallclock secs ( 2.23 usr +  0.00 sys =  2.23 CPU) @ 448.43/s (n=1000)
            Rate each keys
      each 418/s   --  -7%
      keys 448/s   7%   --
      

      ...so it looks like keys generally wins - because of the 'random' order that things come out, you can't tell how soon a loop test break you out of the loop.

      Thanks tilly

      rdw

Re: each or keys?
by chromatic (Archbishop) on Oct 11, 2000 at 02:44 UTC
    Try it this way:
    #!/usr/bin/perl -w use Benchmark; use strict; use integer; my %top_hash; for (my $i = 0; $i < 1000; $i++) { $top_hash{$i} = $i; } sub e { my $total = 0; my %hash = %top_hash; while (my($k, $v) = each %hash) { $total += $v; last if $total > 500; } } sub k { my $total = 0; my %hash = %top_hash; foreach my $k (keys %hash) { my $v = $hash{$k}; $total += $v; last if $total > 500; } } timethese(500, { 'each' => \&e, 'keys' => \&k });
    Why copy the hash? My guess is that there's an internal optimization going on the first time each is called. If you copy the old hash in every time, that optimization won't take place on the hash you're passing in to both subs.

    Of course, you might print out the first ten elements of each and see if they're coming out in the same order. If the first two values you get out of each are both above 250, you'll jump out right there. If it takes a dozen from keys to get over 500, you have to go through the loop six more times. (I don't think that's likely, but it's good to check these kinds of assumptions before benchmarking).

    My results for the revised code (500 iterations because copying a 1000-element hash is pretty time consuming):

    Benchmark: timing 500 iterations of each, keys... each: 4 wallclock secs ( 3.77 usr + 0.01 sys = 3.78 CPU) keys: 5 wallclock secs ( 4.85 usr + 0.00 sys = 4.85 CPU)
      Copying the hash means that keys and values might come out in different orders, and since you're quitting when the total reaches a certain point, you have a bogus benchmark.

      I suggest setting all the values to be the same and rerunning the benchmarks.

      -- Randal L. Schwartz, Perl hacker

        Good call

        Benchmark: running each, keys, each for at least 10 CPU seconds...
              each: 11 wallclock secs (10.51 usr +  0.00 sys = 10.51 CPU) @ 417.32/s (n=4386)
              keys: 11 wallclock secs (10.53 usr +  0.00 sys = 10.53 CPU) @ 286.89/s (n=3021)
              Rate keys each
        keys 287/s   -- -31%
        each 417/s  45%   --

        This is with $top_hash{$i} = 50; and cmpthese(-10, { 'each' => \&e, 'keys' => \&k }); and otherwise the same as above.

        --
        $you = new YOU;
        honk() if $you->love(perl)

Re: each or keys?
by btrott (Parson) on Oct 11, 2000 at 02:34 UTC
    You wrote:
    > Now I understand that 'keys' would have to > generate a list which might be expensive I > guess, but I'm still a bit puzzled by the results.
    Well, you hit the proverbial nail on the head there, though. So I'm not sure there's much to be puzzled about, unless you're just wondering why keys is so slow at generating the list.

    But it is that very fact--that keys has to generate a list of the keys each time--that makes it so much slower when breaking early out of the loop. Simply put, you need to do *much* less work when you break early out of the loop when using each, because you haven't had to iterate over the entire hash and build a list.

    If, for example, you build the list of keys before you run the benchmark, then in k iterate over that list instead of calling keys anew, you'll find that indeed it is that call that's slowing it down.