Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^6: Best way to store/sum multiple-field records? (carte blanche)

by Laurent_R (Canon)
on Dec 23, 2014 at 19:02 UTC ( [id://1111211]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Best way to store/sum multiple-field records? (carte blanche)
in thread Best way to store/sum multiple-field records?

I thought it would be interesting to run BrowserUk's benchmark after having fixed the two little defects.

This is the modified code:

use strict; use warnings; use Benchmark qw( cmpthese ); my @strings = qw( USERID1|2215|Jones| USERID1|1000|Jones| USERID3|1495|Dole| USERID2|2500|Francis| USERID2|1500|Francis| ); cmpthese( -1, { outside => sub { my ( $x, $y, $z ); for (@strings) { ( $x, $y, $z ) = split /\|/; } }, outside2 => sub { my ( $x, $y, $z ); for (@strings) { ( $x, $y, $z ) = split /\|/, $_, 3; } }, inside => sub { for (@strings) { my ( $x, $y, $z ) = split /\|/; } }, inside2 => sub { for (@strings) { my ( $x, $y, $z ) = split /\|/, $_, 3; } }, } );
And the benchmark results:
$ perl bench_inside_outside.pl Rate outside outside2 inside inside2 outside 90269/s -- -20% -40% -51% outside2 113390/s 26% -- -25% -39% inside 151060/s 67% 33% -- -19% inside2 185735/s 106% 64% 23% --
So, (hoping the code is now correct), the results are now consistently showing (1) the quite strong advantage of declaring the variables inside the loop compared to doing before entering the loop (these results are well in line with AnonMonk's reported results), and (2) that choroba's idea to specify a limit also bring a measurable improvement (much less strong than the inside/outside declaration, but I would tend to think that a difference of about 25% is significant, and no longer noise).

That second point is interesting, because I have experienced in the past that specifying a limit brings an improvement when the string being split would yield (without limit) more fields than the limit, presumably because Perl is able to stop processing the string as soon as the limit is reached, but I would have thought that this advantage would to a large extent vanish when the limit is the same as the number of potential fields in the string being split. Good to know. Thank you choroba for this comment.

Replies are listed 'Best First'.
Re^7: Best way to store/sum multiple-field records? ("significant")
by tye (Sage) on Dec 23, 2014 at 19:49 UTC

    Well, if you had repeated the same check that I did above (verify that your different approaches are actually doing the same thing), then you would have found that specifying the limit does indeed change the results (making the benchmark of questionable validity).

    Also, split documents that the code without the limit actually infers a limit. That is, these two lines are equivalent:

    ( $x, $y, $z ) = split /\|/; ( $x, $y, $z ) = split /\|/, $_, 4;

    But the (implied) limit is 4, not 3, because that limit actually results in not changing the results you get. This also results in no performance improvement.

    I would tend to think that a difference of about 25% is significant, and no longer noise

    Well, the 25% rises slightly above the level of simply "noise", in that it is likely just enough to be at least moderately consistent (not flipping to a negative value on subsequent runs of the same benchmark). For such micro operations, even 20% can easily be simply noise, IME.

    I actually think that Benchmark.pm should not (by default) try to "subtract overhead" as this (understandable) attempt to increase accuracy almost always just produces misleading or useless information. If the operations being benchmarked are so small that subtracting overhead matters much, then they are also so small that (in Perl) the moderate changes in their performance almost never matter.

    So I certainly disagree that getting Perl to skip stripping just the final "|" is "significant" when considering performance (but certainly could be significant as to whether your results are correct or not). Build a real script that does something useful with the benchmarked code and then add in the limit of 3 (and tolerate getting different results) and I'd be shocked if the performance difference is something anybody would notice. The total performance change might even be so small as to make it difficult to even consistently measure. (Of course, if you drastically change, for example, the data being operated on, then you can drastically change the performance differences and so the results could be completely different -- even in the opposite direction.)

    And I consider "is significant" to be a higher bar than "is noticeable" (which is higher than "can be measured").

    If you want to know if you have actually found a performance improvement that is "significant", then you really need to benchmark operations that (in Perl) take a lot more than 1 micro-second.

    You can do this by operating on more or larger data. Though, this makes your benchmark of questionable validity if you won't be dealing with such data in reality. Or you can iterate many times within each bit of code being benchmarked. This also may threaten the validity of your benchmark if the code in question will never be iterated in such a direct way, of course.

    If you want accurate benchmarks, then benchmark code that is an accurate representation of what your real code will be doing. Your real code doesn't get to "subtract overhead" so Benchmark.pm doing that mostly makes the results less realistic (but more "fun").

    - tye        

      Thank you very much, tye, for your very useful and interesting comments.
      Well, if you had repeated the same check that I did above (verify that your different approaches are actually doing the same thing) ...
      Sadly enough, I actually did it before running the benchmark, as shown only in part here:
      $ perl -e '$_ = "USERID1|2215|Jones|"; > my( $x, $y, $z ) = split /\|/; > print "( $x, $y, $z )\n";' ( USERID1, 2215, Jones ) $ perl -e '$_ = "USERID1|2215|Jones|"; > ( $x, $y, $z ) = split /\|/, 3; > print "( $x, $y, $z )\n"; > ' ( 3, , ) $ perl -e '$_ = "USERID1|2215|Jones|"; > ( $x, $y, $z ) = split /\|/, $_, 3; > print "( $x, $y, $z )\n"; > ' ( USERID1, 2215, Jones| )
      but I looked at the results too quickly and failed to see the difference (i.e. "Jones" versus "Jones|"). And this difference is quite significant.

      So, I decided to run again the test, not changing the code, but rather changing the data to:

      my @strings = qw( USERID1|2215|Jones USERID1|1000|Jones USERID3|1495|Dole USERID2|2500|Francis USERID2|1500|Francis );
      just because this is more in line with the type of data that I have to deal most frequently (no separator at line end), so that is the result:
      $ perl bench_inside_outside.pl Rate outside outside2 inside inside2 outside 110902/s -- -2% -39% -40% outside2 113390/s 2% -- -38% -39% inside 181595/s 64% 60% -- -2% inside2 186121/s 68% 64% 2% --
      Now, clearly, a 2% difference is not significant, this shows that my original untested opinion that it did not really matter to put a limit to the split if the number of available fields is equal to the limit was correct, and that my subsequent opposite opinion based on a faulty test was wrong. Thank you for you enlightenment on this. Just in case someone worries, I am not concluding from that I should believe my untested opinion rather than my test results, but clearly I should be more cautious about the significance of my tests.

      Without getting into the details of your very interesting post, I would say that, sometimes, I really need to know whether one way of doing things if significantly faster than another (say, for example, s/// versus tr///, or m// versus index(), etc.). But in the end,, only real tests with real data really make sense. The benchmark module is quite useful to prune early the tree of possible courses of action. In the end, only test with real data really matters.

      I am dealing with a 35M customer base, with about a billion billing services, and dozens of billions of usages (phone calls, SMS, Internet Connections, Video down loadings, etc.) per month. Performance matters for me.

      Benchmarks provided by the benchmark module give quite interesting information about the best way to do things, but the really interesting data comes from actual testing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1111211]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-24 01:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found