Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Micro optimisations can pay off, and needn't be a maintenance problem

by BrowserUk (Pope)
on Nov 12, 2002 at 06:42 UTC ( #212211=perlmeditation: print w/ replies, xml ) Need Help??

I understand the call for not falling into the trap of premature optimisation, but there are times when performance is critical. The example I will use is in a graphics app like rbc's recent [untitled node, ID 211630] code. (Nice code BTW rbc)

When he first posted this (as a SoPW) I downloaded it and took a look. I've written several 3D graphics libraries down the years, though I've usually opted for hidden line removal rather than zbuffer, which has it limitations, so I was interested. Unfortunately, trying to help debug the code was hampered by it's rather poor performance. So I profiled in in the hope of reducing the edit debug cycle.

One of the major causes of the slow performance is his (correct to OO principles) use of accessor methods within the Vector2D and Vector3D classes. Within these he is using the 'right way', shifting or list assigning his parameters into named vars. This means that you end up with lots of small subs doing

sub setx { my ($self, $value) = @_; $self->{_x}=$value; }

As there are some 250,000 calls to routines of this nature in a single draw cycle, modifying these to do

sub setx { $_[0]->{_x} = $_[1] }

has a substantial effect on the performance of the code. Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs. Still not fast enough to allow the dynamic rotation of the cube (though the major bottleneck is now the Tk; module rather than the vector code itself), but still substantial in this case to be worth having. Of course, the downside is that readability suffers greatly with $_[0] representing a vector in one place a constant in the next and something completely different elsewhere, which doesn't make for readable code.

However, I thought about this and think there is a simple solution (that I haven't seen elsewhere, but may have been done before) to using speed gains of the direct parameter access, without loosing the clarity of named parameters. The clue is in the word I used above. Constants.

Instead of using either of the versions above, I think that I will adopt this

use constant SELF => 0; use constant VALUE=> 1; sub setx{ $_[SELF]->{_x} = $_[VALUE]; }

Which I believe gives the best of both worlds of clarity and self documentation, and speed.

The use of constants appears to have negligable impact on the performance of the direct access, as one would expect, but on simple accessor functions renders around a 25% performance boost over named parameters using my. Possibly not worth the effort for non-performance critical apps, but worth having where it is needed.

I actually don't see any real reasons (yet) for not using this in place of shifting to my variables where the side effects are potentially beneficial, or at least not detrimental.

Any thoughts?

My benchmark for the interested.

#! perl -sw use strict; package Things1; sub new { my $class = shift; my $self = { _thing=>0 }; return bless $self, $class; } sub get_it { my $self = shift; return $self->{_thing}; } sub set_it { my $self = shift; my $value = shift; return $self->{_thing} = $value; } 1; package Things2; sub new { my ($class) = @_; my $self = { _thing=>0 }; return bless $self, $class; } sub get_it { my ($self) = @_; return $self->{_thing}; } sub set_it { my ($self, $value) = @_; return $self->{_thing} = $value; } 1; package Things3; sub new { return bless { _thing=>0 }, $_[0]; } sub get_it { return $_[0]->{_thing}; } sub set_it { return $_[0]->{_thing} = $_[1]; } 1; package Things4; use constant CLASS => 0; use constant SELF => 0; use constant VALUE => 1; sub new { return bless { _thing=>0 }, $_[CLASS]; } sub get_it { return $_[SELF]->{_thing}; } sub set_it { return $_[SELF]->{_thing} = $_[VALUE]; } 1; package main; use strict; use vars qw/$COUNT $RUNS/; use Benchmark qw/cmpthese/; $\=$/; print $COUNT,,$RUNS; my $thing1 = new Things1; print $thing1; print "Value:", $thing1->get_it(); $thing1->set_it(99999); print "Value:", $thing1->get_it(); my $thing2 = new Things2; print $thing2; print "Value:", $thing2->get_it(); $thing2->set_it(99999); print "Value:", $thing2->get_it(); my $thing3 = new Things3; print $thing3; print "Value:", $thing3->get_it(); $thing3->set_it(99999); print "Value:", $thing3->get_it(); my $thing4 = new Things4; print $thing4; print "Value:", $thing4->get_it(); $thing4->set_it(99999); print "Value:", $thing4->get_it(); cmpthese( $::RUNS, { shift => sub{ my $thing = new Things1; $thing->set_it( 1 + $th +ing->get_it() ) for 1 .. $::COUNT; }, assign => sub{ my $thing = new Things2; $thing->set_it( 1 + $th +ing->get_it() ) for 1 .. $::COUNT; }, direct => sub{ my $thing = new Things3; $thing->set_it( 1 + $th +ing->get_it() ) for 1 .. $::COUNT; }, constant=> sub{ my $thing = new Things4; $thing->set_it( 1 + $thin +g->get_it() ) for 1 .. $::COUNT; }, } ); __END__ c:\test>212101 -COUNT=10000 -RUNS=10 1000010 Things1=HASH(0x1bdf190) Value:0 Value:99999 Things2=HASH(0x1bdf1b4) Value:0 Value:99999 Things3=HASH(0x1bd51c8) Value:0 Value:99999 Things4=HASH(0x1bd51f8) Value:0 Value:99999 Benchmark: timing 10 iterations of assign, constant, direct, shift ... assign: 4 wallclock secs ( 3.64 usr + 0.00 sys = 3.64 CPU) @ 2 +.75/s (n=10) constant: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ 3 +.15/s (n=10) direct: 3 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ 3 +.14/s (n=10) shift: 4 wallclock secs ( 3.96 usr + 0.00 sys = 3.96 CPU) @ 2 +.53/s (n=10) Rate shift assign direct constant shift 2.53/s -- -8% -20% -20% assign 2.75/s 9% -- -12% -13% direct 3.14/s 24% 14% -- -0% constant 3.15/s 25% 14% 0% -- c:\test>

Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for cyclone or a hurricane, in which case 16 hour shifts are mandatory.
Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

SELF

Comment on Micro optimisations can pay off, and needn't be a maintenance problem
Select or Download Code
Re: Micro optimisations can pay off, and needn't be a maintenance problem
by Abigail-II (Bishop) on Nov 12, 2002 at 16:24 UTC
    I don't understand. You optimize one thing, attribute access, by 25%, but you get an overall speed up of a factor of 6. How's that possible?

    Abigail

      As I read it, 25% optimization of a single item, said item being called a quarter of a million times for the timed operation. If it was a 100 second improvement in performance overall, that would amount to a .0004-second improvement in attribute access (multiplied over accessing them 250K times). Or do I oversimplify?
        Yeah, and? It doesn't explain at all how a 25% optimization of a single item results in a 6 time speed up overall.

        Abigail

      You optimize one thing, attribute access, by 25%, but you get an overall speed up of a factor of 6. How's that possible?

      The benchmark test cases include overhead for object creation. That might skew the test results a bit.

      Getters and setters aren't benchmarked individually. Instead, the benchmark tests call getters and setters an equal number of times. But in the Z Buffer Demo getters seem to predominate (based on a quick read the code). Intuitively, getters are going to be faster than setters, perhaps enough so to explain a factor of 6 improvement.

        Well, my analysis in this case should be taken with a salt lick, but breaking it out into the original (overall) benchmark followed by one each create/set and create/get shows that the "get" case is roughly 33% (up to 40%) faster than the "set" case. It also shows that the "set" gets (2%-3%) more of an improvement out of this optimization than the "get" does.

        Breaking it out further, and accessing the previously-set $thing1 through $thing4 rather than creating a new object each iteration, I see that the individual "get" and "set" are actually about 37% faster.

        So object creation does factor in, and getters are faster. It's still "only" a 37% improvement, rather than sixfold, but it's not a unilateral 37%.

      I had been hoping to eventually see a response to explain this impossibility. This thread just now came up in the chatterbox and I'm glad that BrowserUk took some time to help me understand his perspective on the issue.

      The title of the node seems to clearly state that micro optimizations can "pay off". The node clearly indicates that about a 6-fold speed-up was obtained via micro optimization (and BrowserUk confirms in the chatterbox that this was the case). Some people who responded in this thread seem to think that it is possible for a 25% speed-up via a micro optimization to cause a 6-fold speed-up. Even more people have indicated that they think this node made exactly that claim (but BrowserUk explains that this was not his intent -- more on this in a bit).

      I can just see this thread being a huge boost to the problematic micro optimization craze. So I think it is important to strongly counter the impression this thread has been giving some people.

      First, let's address the part that BrowserUk did not intend to convey.

      [...], modifying these to do [...] has a substantial effect on the performance of the code. Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs.
      To me, this very clearly implies that this one simple micro optimization results in about a 6-fold speed-up. BrowserUk explains that just a few lines up, the sentence:
      One of the major causes of the slow performance is his [...] use of [...]
      was meant to indicate that several micro optimizations combined to give the 6-fold speed-up.

      So we do not have an example of a single, less-than-100% speed-up producing a total speed-up of over 500%. That would be impossible.

      If you have a micro-optimization that Benchmark.pm is capable of discerning a 50% speed-up in, you will be unlikely to see even a 25% total speed-up. Benchmark.pm goes to extreme lengths to try to isolate the item being timed. Even if your code does nearly nothing but that same bit over and over, there will be overhead that Benchmark.pm tries to eliminate from the calculations. But this overhead cannot be eliminated from the running of your script and will not be sped up by your optimization and so it will dillute the total speed up.

      In practice, a 50% speed-up via a micro optimization is pretty rare and is likely to only have a marginal impact on the total run-time of your script since most scripts don't spend even 50% of their time doing just one tiny thing. The smaller the operation you optimize, the smaller the part it is likely to play in the total run time of your script. It is a very unusual script that has even close to 50% of its total run time tied up in a "micro" operation.

      If your script spends 20% of its time doing the operation, then a 40% speed-up in that operation can't give you more than a 8% total speed-up. Measuring these percentages is certainly tricky, but the math is not tricky.

      Let's make the math really simple and even give the optimizers a big advantage. We'll interpret a "40% speed-up" as meaning that the new run time is 40% less than the original run time (new run time = 60% of original run time). Benchmark.pm would actually report this as "-40%" (the original being 40% slower) and "+71%" (the new being 71% faster) so the person doing the optimization would surely claim a "71% speed-up".

      So the simplified math would be: 1 - ( 0.80 + 0.20*(1-0.40) ) = 1 - 0.92 = 0.8 = 8%
      but in reality that represents a 71% operation speed-up producing a total speed-up of 7.4% (when the operation is 20% of the total run time).

      And having seen lots of micro optimization attempts, I feel confident in saying that micro optimization are most likely to produce less than a 20% speed-up and consist of less than 10% of the total run-time and so are very unlikely to give you even a 2% total speed-up.

      "So, if you have 100 such optimizations, you could expect a 200% speed-up"

      No.

      If you have 100 optimizations, then they can't total more than 100% of the run-time. If you have 100 micro optimizations, then I'd bet they don't total more than 50% of the run-time (because Perl has quite a bit of overhead). Even if all 100 of those micro optimizations made each of their operations 10,000-times faster, the total speed up would be less than a 50% reduction in the total run time.

      So I don't even believe BrowserUk's claim that several micro optimizations together resulted in a 6-fold speed-up.

      To get a 5-fold speed-up, you need to eliminate 4/5th (80%) of the run time. So, for example, if you amazingly found several micro operations that accounted for 90% of the run time, you'd have to speed each of those operations up 9-fold in order to get a 5-fold total speed-up:

      0.20 = 0.10 + 0.90 * x 0.10 / 0.90 = x 1/9 = x
      Well, BrowserUk chose a "major cause" of the slowness and managed to speed it up 25% (not 900%)! It just doesn't add up.

      I'm sure BrowserUk believes that micro optimizations made the code 6 times as fast. But I can't believe that conclusion at all based on the evidence present.

      I can only guess that a non-micro optimization made a more-than-6-fold improvement of a huge percentage of the total run time, leading to the over-all 6-fold speed-up and BrowserUk did not realize/recognize this.

      So I don't believe "micro optimizations can pay off" with a 6-fold speed-up.

                      - tye

        Okay Tye, as requested. I started with the version of the code as currently posted at [untitled node, ID 211630] which is a somewhat different version to the original I believe but is close enough for the differences made by rbc not to be an issue I think. I looked at trying to reverse all the non-trivial modifications I made to the original after I posted Micro optimisations can pay off, and needn't be a maintenance problem, but this proved beyond my memories capability after all this time.

        The two sets of (crude) comparative timings below show that on my machine, the original code takes 2 minutes and 5 seconds. This was measured by installing print scalar localtime; at the top and bottom of the sub main::Doit Crude, but more accurate than the "watch the clock" method I was using when I did this first time around.

        My optimised version of the code below shows that without making any changes to structure of either the code or the data, this timing has been reduced to under 4 seconds. I tried to stop when I acheived the original "around 20 seconds", but couldn't work out how.

        Now the debate will centre on whether the changes I have made constitute micro-optimisations or not. Another reason for avoiding the debate in the first place. The changes I have made include.

        • Constantising parameter access.
        • Removal of temporary variable usage.
        • Common sub-expression elimination.
        • Re-coding of if, for(;;), while statements into their modifier forms
        • Merging of consecutive loops and/or nested loops into single loops or grep or map forms.
        • Moving invarient expressions, sub-expressions and statements from inside loops to outside.
        • In two cases only, manually in-lining two trivial functions where they are called in only one place and (IMO) served no purpose by way of clarification of the code and were therefore, pure overhead.

          The two functions are compDZ() and setPixel

        I would characterise all these changes under the heading "re-factoring" as no functionality has been added, changed, or removed, nor been moved. I believe all these types of changes, have at various times by various people, been described as "micro-optimisations", and in that way been frowned upon. I did not set out to disprove this advice. I set out to assist a fellow monk with his question, and discovered that the edit/run cycle took so long that it discouraged me (and others if you look at the original thread). So, as I had an interest in seeing how the Z-buffer algorithm had been implemented, I set about trying to reduce the cycle time using simple changes. During this process, I found many places in the code where a single line would call several accessor subs. Each of these subs was shifting the argument list into named vars. Often these named vars were then passed as parameters to nested calls where they again were being shifted into another set of named vars. I originally recoded these accesses using the $_[n] nomenclature but that rapidly became confusing especially in subs where there were 4, 5 or 6 parameters.

        I then hit upon the idea of using constant subs to give meaningful names to the parameters. I had recently read somewhere that constant subs are inlined by the compiler and therefore have no more overhead than coding the constant itself and when I benchmarked a small testcase, this proved to be the case. I felt that as I hadn't seen this "discovery" described anywhere else, it was worth sharing with the community and I made my post.

        The controversy, such as it is, centers on the (obviously mathematically impossible) idea that a 25% reduction in attribute access could in some way be responsible for a "speed up factor of 6". (That phrase is in quotes because I never made that claim.) The obvious answer is "It cannot". And the reason I left the original question unanswered is that the answer is so obvious that I assumed it to be rhetorical.

        In hindsight, the phrase "Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs." from my original post should have read "Contributing to a reduction in the draw time of the cube, from around 2 minutes (on my box) to around 20 secs.".

        Had that read "from exactly 125.2 seconds to exactly 20.7", then I could have seen the source of the misunderstanding and the need for this belated re-examination. As it stands, if anyone feels that they where fooled by an obvious mathematical impossibility or that the idea of 'Constantising parameter access' was unworthy of a wider audience then I apologise.

        Likewise, if anyone believes that understanding the costs and benefits of TIMTOWTDI, and using them to achieve a reduction in runtime of one, non-trivial, real-world application from 125 seconds to under 4 is not worthy of consideration, I again apologise.

        Finally, if any monk feels that I profited unduly (in the oft-described, 'meaningless', XP stakes) from my original post (150 currently), then I invite them to downvote this post in recompense. Or if sufficient numbers believe that to be the case, I am quite sure that it is possible to have that number or any abitrary "fine" of XP deducted from my tally. I will not argue the point regardless.

        Please find attached my crude timings of the original and modified programs along with the modified code. I did set out to diff the two and highlight each of the changes and give my justification for why each change was, in and of itself, a micro-optimisation, but I got bored with the whole laborious process and gave up. Sorry.

        It might be worth noting that I did go on to make several more, decidedly non-trivial modifications when I was doing this originally. This included such things as

        • Using arrays and constants instead of hashes and named elements as the basis of the low-level classes.
        • Re-writing all the methods of those classes to use direct access to the instance data internally instead of the accessor functions.
        • Re-structuring the Z-Buffer itself to be a single linear array instead of an array of arrays.

        Each of these steps had a further benefit on performance, but the main blockage had moved to the Tk code and that was beyond my (self-assigned) breif to modify. I think that doing the drawing into a bitmap and blitting this to the canvas would allow the performance to be reduced by (possibly) an order of magnitude, as the overhead inherent in drawing using discrete, hit-testable lines does nothing for the application as it stands.

        Original code - timing and profile

Re: Micro optimisations can pay off, and needn't be a maintenance problem
by perrin (Chancellor) on Nov 12, 2002 at 17:44 UTC
    There are two things I'd like to point out about this. The first is that you have found an extremely rare beast: a CPU-bound Perl program! In practical applications of Perl, nearly all programs are limited by IO (database, file, network) rather than CPU. This is why no one would write a 3D rendering tool in Perl except as a learning exercise like this one.

    The second is that most of us don't actually object to micro-optimization once you've found that a particular routine is the bottleneck. What I object to is the tendency many people have to say "I haven't done any profiling or measuring of any kind to see what the problem is, but I'm going to change all of this section to use some ugly code because 'map' is 2% faster than 'foreach'", or some such.

    The biggest gains always come from changing large things, like removing an unnecessary IO operation or changing your algorithm or data structure, but if you find that a particular sub is a problem then it does make sense to try and fix it. If your object access really is the thing that's killing performance here, you could also try using arrays instead of hashes for the objects. Here's an example with your accessor optimization:

    package Things5; use constant CLASS => 0; use constant SELF => 0; use constant VALUE => 1; use constant THING => 0; sub new { my $object = []; $object->[THING] = 0; return bless $object, $_[CLASS]; } sub get_it { return $_[SELF]->[THING]; } sub set_it { return $_[SELF]->[THING] = $_[VALUE]; } 1;
      'map' is 2% faster than 'foreach'

      Wow, I never knew that! I'll go and run a search and replace over all my scripts now.

      :-)

Re: Micro optimisations can pay off, and needn't be a maintenance problem
by John M. Dlugosz (Monsignor) on Nov 12, 2002 at 21:35 UTC
    I like it, and I'll certainly remember it.

    In your setx example though, I don't think that constants will be local. Even if performed in a block, won't that still put the generated inlinable function into the enclosing package?

    Another idea is to have some other code generate the unreadable code. I've seen systems that generate accessor functions, in particular. Might as well generate the fastest code.

      Indeed, positioning the use constant... calls inside the function body in the short example is misleading. Sorry.

      As constants constructed this way become constant subs, they end up having package scope just as any sub would. It probably makes sense to have them grouped at the top of the package as shown in the benchmark code, or grouped above the sub which they are related to if the package is bigger, but in anycase, in a position where their scope is (more) obvious from their position. I'll change that. Thanks.



      Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
      Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
      Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
      Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

      Another idea is to have some other code generate the unreadable code. I've seen systems that generate accessor functions, in particular. Might as well generate the fastest code.

      I would certainly go this way. The Makefile for the package can certainly include a small script that would turn sub setx { my ($self, $value) = @_; $self->{_x}=$value; } into sub setx { $_[0]->{_x} = $_[1] }

      This is actually what I do for XML::Twig: the code I maintain is in Twig.pm.slow, and the Makefile.PL includes the following line:

      'depend' => { 'Twig.pm' => "Twig.pm.slow\n\t\$(PERL) speedup Twig.pm.slow > Twig.pm"}

      Note the use of \$(PERL) which finds the proper perl even if the command is /usr/local/bin/weird-perl Makefile.PL instead of the perl that's in the path.

      speedup does things like:

      $SET_FIELD="first_child|next_sibling|pcdata|cdata|ent|data|target|comm +ent|flushed"; s{(\$[a-z_]+(?:\[\d\])?) # $1: the object ->set_($SET_FIELD) # $2: the field to set \(([^)]*)\)} # $3: the value (must not include + ()) {$1\->\{$2\}= $3}gx;

      This replaces proper accessors by direct access to the hash.

        Hmm, I think I would prefer to keep it all in one file, perhaps putting the "original" in a =tobespeeded/=cut section.

        But, I like the idea. You are basically inlining the functions that the compiler can't.

        However, it also loses any polymorphism. That is, it becomes non-virtual.

Re: Micro optimisations can pay off, and needn't be a maintenance problem
by Courage (Parson) on Nov 13, 2002 at 09:02 UTC
    Did you considered using use Inline C => '/*c code here*/'; for speeding up even more your critical parts of program?
    I think this module has a perfect fit to a situation and yet it is very convenient to use.

    Courage, the Cowardly Dog

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://212211]
Approved by dws
Front-paged by JaWi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2014-12-21 19:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (107 votes), past polls