Re^7: Best way to store/sum multiple-field records? ("significant")

Well, if you had repeated the same check that I did above (verify that your different approaches are actually doing the same thing), then you would have found that specifying the limit does indeed change the results (making the benchmark of questionable validity).

Also, split documents that the code without the limit actually infers a limit. That is, these two lines are equivalent:

( $x, $y, $z ) = split /\|/;
( $x, $y, $z ) = split /\|/, $_, 4;
[download]

But the (implied) limit is 4, not 3, because that limit actually results in not changing the results you get. This also results in no performance improvement.

I would tend to think that a difference of about 25% is significant, and no longer noise

Well, the 25% rises slightly above the level of simply "noise", in that it is likely just enough to be at least moderately consistent (not flipping to a negative value on subsequent runs of the same benchmark). For such micro operations, even 20% can easily be simply noise, IME.

I actually think that Benchmark.pm should not (by default) try to "subtract overhead" as this (understandable) attempt to increase accuracy almost always just produces misleading or useless information. If the operations being benchmarked are so small that subtracting overhead matters much, then they are also so small that (in Perl) the moderate changes in their performance almost never matter.

So I certainly disagree that getting Perl to skip stripping just the final "|" is "significant" when considering performance (but certainly could be significant as to whether your results are correct or not). Build a real script that does something useful with the benchmarked code and then add in the limit of 3 (and tolerate getting different results) and I'd be shocked if the performance difference is something anybody would notice. The total performance change might even be so small as to make it difficult to even consistently measure. (Of course, if you drastically change, for example, the data being operated on, then you can drastically change the performance differences and so the results could be completely different -- even in the opposite direction.)

And I consider "is significant" to be a higher bar than "is noticeable" (which is higher than "can be measured").

If you want to know if you have actually found a performance improvement that is "significant", then you really need to benchmark operations that (in Perl) take a lot more than 1 micro-second.

You can do this by operating on more or larger data. Though, this makes your benchmark of questionable validity if you won't be dealing with such data in reality. Or you can iterate many times within each bit of code being benchmarked. This also may threaten the validity of your benchmark if the code in question will never be iterated in such a direct way, of course.

If you want accurate benchmarks, then benchmark code that is an accurate representation of what your real code will be doing. Your real code doesn't get to "subtract overhead" so Benchmark.pm doing that mostly makes the results less realistic (but more "fun").

- tye

Comment on Re^7: Best way to store/sum multiple-field records? ("significant") Download Code

Replies are listed 'Best First'.

Re^8: Best way to store/sum multiple-field records? ("significant")
by Laurent_R (Canon) on Dec 24, 2014 at 00:22 UTC

tye

Well, if you had repeated the same check that I did above (verify that your different approaches are actually doing the same thing) ...

$ perl -e '$_ = "USERID1|2215|Jones|";
> my( $x, $y, $z ) = split /\|/;
> print "( $x, $y, $z )\n";'
( USERID1, 2215, Jones )

$ perl -e '$_ = "USERID1|2215|Jones|";
> ( $x, $y, $z ) = split /\|/, 3;
> print "( $x, $y, $z )\n";
> '
( 3, ,  )

$ perl -e '$_ = "USERID1|2215|Jones|";
> ( $x, $y, $z ) = split /\|/, $_, 3;
> print "( $x, $y, $z )\n";
> '
( USERID1, 2215, Jones| )
[download]

So, I decided to run again the test, not changing the code, but rather changing the data to:

my @strings = qw(
  USERID1|2215|Jones
  USERID1|1000|Jones
  USERID3|1495|Dole
  USERID2|2500|Francis
  USERID2|1500|Francis
);
[download]

$ perl bench_inside_outside.pl
             Rate  outside outside2   inside  inside2
outside  110902/s       --      -2%     -39%     -40%
outside2 113390/s       2%       --     -38%     -39%
inside   181595/s      64%      60%       --      -2%
inside2  186121/s      68%      64%       2%       --
[download]

Without getting into the details of your very interesting post, I would say that, sometimes, I really need to know whether one way of doing things if significantly faster than another (say, for example, s/// versus tr///, or m// versus index(), etc.). But in the end,, only real tests with real data really make sense. The benchmark module is quite useful to prune early the tree of possible courses of action. In the end, only test with real data really matters.

I am dealing with a 35M customer base, with about a billion billing services, and dozens of billions of usages (phone calls, SMS, Internet Connections, Video down loadings, etc.) per month. Performance matters for me.

Benchmarks provided by the benchmark module give quite interesting information about the best way to do things, but the really interesting data comes from actual testing.

[reply]
[d/l]
[select]


Do you know where your variables are?
	PerlMonks