Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Inconsistent Results with Benchmark

by benwills (Sexton)
on Dec 08, 2014 at 03:02 UTC ( #1109513=perlquestion: print w/replies, xml ) Need Help??

benwills has asked for the wisdom of the Perl Monks concerning the following question:

I'm getting consistently inconsistent results with Perl's Benchmark module and wanted to see if there was something I'm missing. I've checked the Benchmark documentation and can't find anything in there on this (though perhaps I missed it). And I also used Super Search and didn't find anything there.

Context: I have a couple dozen regular expressions that will be run millions of times a day. Some will be run billions of times a day. For that reason, I'm trying to get every bit of performance out of these. I'm testing these right now on a box that's doing absolutely nothing else other than being connected to the internet and running these tests. (no servers running, etc) So I can't imagine this is from something else running on the machine taking resources at random times.

I began noticing that certain subroutines which ran earlier in the benchmarking process would often have slower load times. I then tried running additional Benchmarks, swapping the order of which would run first. And, even after swapping the order, the first one would still run slower when, before, it ran faster. I was able to reproduce this, not with 100% consistency, but enough consistency to notice.

In other words, if sub_1 is actually the faster subroutine (as tested by multiple, 60 second benchmarks) by 15%, on the first run of the benchmark of five seconds or less, it will often show as being slower than sub_2.

Here are the patterns of inconsistency I've noticed:

  • The first subroutine to run after the perl script starts usually takes a performance hit. That is, if a_1 is "actually" 10% faster than a_2, it will often run slower on that first run because it is run first (it seems to run in alphabetical order).
  • The more times timethese runs for the same time length the more consistent the results are; e.g. for 10 seconds, five times. But if I run it for different lengths back to back to back, the differences in measurement become noticeable; e.g. 1 sec, then 2 secs, then 3, then 5, then 10, the performance comparison percentage will be inconsistent.
  • The shorter the script runs, the less accurate Benchmark seems to be. I think this is generally common knowledge, but the inconsistencies above, still hold, consistently. In other words, even though it's "inconsistent" at shorter timeframes, whichever subroutine runs first after the script is called, consistently takes a performance hit.

I realize this is all pretty hand-wavey. But I tested it enough times with enough variations that I feel comfortable bringing it here.

If I need to run a bunch of tests and put together some concrete examples and hard numbers, I can do that. But before I did, I wanted to see if it was common knowledge that this happens, or if this is something I should look into more concretely to understand what's going on.

And finally, if there's a more consistent and precise way to do this kind of testing, what would you suggest? I like Benchmark because I can iterate quickly as I have new ideas. Running NYTProf (which I rely on at other times) simply takes too much time to iterate through a bunch of variations as I think of them.

Perl 5.18.2 running on ubuntu 14.04


Edited:

Okay, here's an example.

Benchmark: running a_1, a_2 for at least 5 CPU seconds ... a_1: 5 wallclock secs ( 5.00 usr + 0.05 sys = 5.05 CPU) @ 67 +.33/s (n=340) a_2: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 73 +.35/s (n=388) Rate a_1 a_2 a_1 67.3/s -- -8% a_2 73.3/s 9% -- Benchmark: running a_1, a_2 for at least 60 CPU seconds ... a_1: 60 wallclock secs (60.05 usr + 0.00 sys = 60.05 CPU) @ 74 +.44/s (n=4470) a_2: 63 wallclock secs (63.13 usr + 0.00 sys = 63.13 CPU) @ 73 +.28/s (n=4626) Rate a_2 a_1 a_2 73.3/s -- -2% a_1 74.4/s 2% --

a_1 is 9% slower when it's the first subroutine tested for 5 seconds, but 2% faster when it's run for 60 seconds

And then the only change I make is to switch the names of a_1 and a_2 so that they run in the opposite order.

And now (below), the original a_1 (now a_2) is 9% faster on the 5 second test, as it's the second one run. And it is now 6% faster on the 60 second run.


Benchmark: running a_1, a_2 for at least 5 CPU seconds ... a_1: 5 wallclock secs ( 4.98 usr + 0.08 sys = 5.06 CPU) @ 68 +.58/s (n=347) a_2: 5 wallclock secs ( 5.24 usr + 0.00 sys = 5.24 CPU) @ 74 +.62/s (n=391) Rate a_1 a_2 a_1 68.6/s -- -8% a_2 74.6/s 9% -- Benchmark: running a_1, a_2 for at least 60 CPU seconds ... a_1: 60 wallclock secs (59.26 usr + 0.84 sys = 60.10 CPU) @ 70 +.72/s (n=4250) a_2: 63 wallclock secs (62.98 usr + 0.00 sys = 62.98 CPU) @ 74 +.63/s (n=4700) Rate a_1 a_2 a_1 70.7/s -- -5% a_2 74.6/s 6% --

This is the pattern I'm seeing fairly often.

Replies are listed 'Best First'.
Re: Inconsistent Results with Benchmark
by BrowserUk (Pope) on Dec 08, 2014 at 05:06 UTC
    n the only change I make is to switch the names of a_1 and a_2 so that they run in the opposite order.

    One possibility: if the benchmarked subs/tests cause a fair amount of memory to be allocated; then when the first sub/test runs, it pays the penalty not only of perl allocating that memory from the heap; but also of perl requesting that memory from the OS. However, when the second subroutine/test runs, the memory used by the first sub has been returned to the heap, but not to the OS, so the second sub/test runs more quickly because no (further) requests to the OS for memory are required.

    Mitigation: Add another subroutine, named to be lexically earlier than the others, that simply allocates a large(r) amount of memory, in small chunks. Eg.

    aaaaaaaaaa => q[ my @a; $a[ $_ ] = [ 1 .. 10 ] for 1 .. 1e6; ],

    If you choose the constants in that correctly, this forces the heap to be expanded, in the right way, such that neither of your real tests will require perl to request more memory from the OS; and thus the benchmarking is more accurate.

    Note: That is just one of the possible causes, there are several others. If you posted particular examples of the code being tested, you might get more relevant possibilities and mitigations.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That makes sense. I was loading about 500kb of text into a variable, then running the regular expressions on that.

      I just tested it (without your suggestion) with a file of about 2mb and saw a similar trend. Then tested with a smaller file and saw less of the trend.

      Then I used your idea, with some slight changes, sub a_(){ my @a; $a $_ = 0 for 1 .. (4 * 1024); }, and tested with a few different file sizes. I'm not entirely sure if the changes I made would make that sub not function as you intended. I'm a mediocre programmer and new to Perl, so I don't fully understand how q and square brackets work in your code, even after just looking up some documentation. (I'm sure it'll sink in in a couple of days).

      After I added that sub, I began consistently getting the same results with a 5 second timer as I do with a 60 second timer.

      Thanks for that. It was a bit discouraging earlier to find out that a few days worth of testing was mostly nullified. But understanding a little more about what's going on and figuring out how to compensate for it definitely helps.

      If you think it would be valuable for any reason to put my code up here, I can clean it up and get it up here. Otherwise, I think I'm good.

      (minor edits for clarification)
        I don't fully understand how q and square brackets work in your code, even after just looking up some documentation.

        Benchmark will accept a string containing a piece of code, where you normally supply a subroutine. From the synopsis:

        # Use Perl code in strings... timethese($count, { 'Name1' => '...code1...', 'Name2' => '...code2...', }); # ... or use subroutine references. timethese($count, { 'Name1' => sub { ...code1... }, 'Name2' => sub { ...code2... }, }); # cmpthese can be used both ways as well cmpthese($count, { 'Name1' => '...code1...', 'Name2' => '...code2...', }); cmpthese($count, { 'Name1' => sub { ...code1... }, 'Name2' => sub { ...code2... }, });

        That's what my example did.

        What actually happens under the covers (greatly simplified) is that a call to the code reference (subroutine) that you supply to Benchmark is eval'd into another subroutine within the package that wraps that call in a loop:

        my ($subcode, $subref); if (ref $c eq 'CODE') { $subcode = "sub { for (1 .. $n) { local \$_; package $pack; &\$c; +} }"; $subref = eval $subcode; } else { $subcode = "sub { for (1 .. $n) { local \$_; package $pack; $c;} } +"; $subref = _doeval($subcode); }

        As you can see, if what you supply is a string rather than a code ref, that string is eval'd into that extra level of subroutine instead.

        From the Benchmark docs:

        CAVEATS

        Comparing eval'd strings with code references will give you inaccurate results: a code reference will show a slightly slower execution time than the equivalent eval'd string.

        So either use code refs, or strings, but do not mix the two. (Though in the case of our dummy sub that just forces preallocation of memory, it doesn't matter as it isn't a part of the timing.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Wouldn't just running each sub once before starting the benchmark do just as well?
        --
        A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |
Re: Inconsistent Results with Benchmark
by Anonymous Monk on Dec 08, 2014 at 13:14 UTC
    When benchmarking, you need to collect a large number of "observations" under varying and trying-to-be-realistic conditions, then do summary statistics on the dataset of results. Any one single observation might be skewed.
      Any one single observation might be skewed.

      Which is why Benchmark already runs many iterations to generate its statistics.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1109513]
Approved by Jim
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2019-12-07 17:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (162 votes). Check out past polls.

    Notices?