Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: Micro optimisations can pay off, and needn't be a maintenance problem (I don't believe it)

by tye (Sage)
on Dec 27, 2002 at 23:59 UTC ( [id://222651]=note: print w/replies, xml ) Need Help??


in reply to Re: Micro optimisations can pay off, and needn't be a maintenance problem
in thread Micro optimisations can pay off, and needn't be a maintenance problem

I had been hoping to eventually see a response to explain this impossibility. This thread just now came up in the chatterbox and I'm glad that BrowserUk took some time to help me understand his perspective on the issue.

The title of the node seems to clearly state that micro optimizations can "pay off". The node clearly indicates that about a 6-fold speed-up was obtained via micro optimization (and BrowserUk confirms in the chatterbox that this was the case). Some people who responded in this thread seem to think that it is possible for a 25% speed-up via a micro optimization to cause a 6-fold speed-up. Even more people have indicated that they think this node made exactly that claim (but BrowserUk explains that this was not his intent -- more on this in a bit).

I can just see this thread being a huge boost to the problematic micro optimization craze. So I think it is important to strongly counter the impression this thread has been giving some people.

First, let's address the part that BrowserUk did not intend to convey.

[...], modifying these to do [...] has a substantial effect on the performance of the code. Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs.
To me, this very clearly implies that this one simple micro optimization results in about a 6-fold speed-up. BrowserUk explains that just a few lines up, the sentence:
One of the major causes of the slow performance is his [...] use of [...]
was meant to indicate that several micro optimizations combined to give the 6-fold speed-up.

So we do not have an example of a single, less-than-100% speed-up producing a total speed-up of over 500%. That would be impossible.

If you have a micro-optimization that Benchmark.pm is capable of discerning a 50% speed-up in, you will be unlikely to see even a 25% total speed-up. Benchmark.pm goes to extreme lengths to try to isolate the item being timed. Even if your code does nearly nothing but that same bit over and over, there will be overhead that Benchmark.pm tries to eliminate from the calculations. But this overhead cannot be eliminated from the running of your script and will not be sped up by your optimization and so it will dillute the total speed up.

In practice, a 50% speed-up via a micro optimization is pretty rare and is likely to only have a marginal impact on the total run-time of your script since most scripts don't spend even 50% of their time doing just one tiny thing. The smaller the operation you optimize, the smaller the part it is likely to play in the total run time of your script. It is a very unusual script that has even close to 50% of its total run time tied up in a "micro" operation.

If your script spends 20% of its time doing the operation, then a 40% speed-up in that operation can't give you more than a 8% total speed-up. Measuring these percentages is certainly tricky, but the math is not tricky.

Let's make the math really simple and even give the optimizers a big advantage. We'll interpret a "40% speed-up" as meaning that the new run time is 40% less than the original run time (new run time = 60% of original run time). Benchmark.pm would actually report this as "-40%" (the original being 40% slower) and "+71%" (the new being 71% faster) so the person doing the optimization would surely claim a "71% speed-up".

So the simplified math would be: 1 - ( 0.80 + 0.20*(1-0.40) ) = 1 - 0.92 = 0.8 = 8% but in reality that represents a 71% operation speed-up producing a total speed-up of 7.4% (when the operation is 20% of the total run time).

And having seen lots of micro optimization attempts, I feel confident in saying that micro optimization are most likely to produce less than a 20% speed-up and consist of less than 10% of the total run-time and so are very unlikely to give you even a 2% total speed-up.

"So, if you have 100 such optimizations, you could expect a 200% speed-up"

No.

If you have 100 optimizations, then they can't total more than 100% of the run-time. If you have 100 micro optimizations, then I'd bet they don't total more than 50% of the run-time (because Perl has quite a bit of overhead). Even if all 100 of those micro optimizations made each of their operations 10,000-times faster, the total speed up would be less than a 50% reduction in the total run time.

So I don't even believe BrowserUk's claim that several micro optimizations together resulted in a 6-fold speed-up.

To get a 5-fold speed-up, you need to eliminate 4/5th (80%) of the run time. So, for example, if you amazingly found several micro operations that accounted for 90% of the run time, you'd have to speed each of those operations up 9-fold in order to get a 5-fold total speed-up:

0.20 = 0.10 + 0.90 * x 0.10 / 0.90 = x 1/9 = x
Well, BrowserUk chose a "major cause" of the slowness and managed to speed it up 25% (not 900%)! It just doesn't add up.

I'm sure BrowserUk believes that micro optimizations made the code 6 times as fast. But I can't believe that conclusion at all based on the evidence present.

I can only guess that a non-micro optimization made a more-than-6-fold improvement of a huge percentage of the total run time, leading to the over-all 6-fold speed-up and BrowserUk did not realize/recognize this.

So I don't believe "micro optimizations can pay off" with a 6-fold speed-up.

                - tye
  • Comment on Re^2: Micro optimisations can pay off, and needn't be a maintenance problem (I don't believe it)
  • Select or Download Code

Replies are listed 'Best First'.
Re: Re^2: Micro optimisations can pay off, and needn't be a maintenance problem (I don't believe it)
by BrowserUk (Patriarch) on Dec 29, 2002 at 16:42 UTC

    Okay Tye, as requested. I started with the version of the code as currently posted at Z Buffer Demo which is a somewhat different version to the original I believe but is close enough for the differences made by rbc not to be an issue I think. I looked at trying to reverse all the non-trivial modifications I made to the original after I posted Micro optimisations can pay off, and needn't be a maintenance problem, but this proved beyond my memories capability after all this time.

    The two sets of (crude) comparative timings below show that on my machine, the original code takes 2 minutes and 5 seconds. This was measured by installing print scalar localtime; at the top and bottom of the sub main::Doit Crude, but more accurate than the "watch the clock" method I was using when I did this first time around.

    My optimised version of the code below shows that without making any changes to structure of either the code or the data, this timing has been reduced to under 4 seconds. I tried to stop when I acheived the original "around 20 seconds", but couldn't work out how.

    Now the debate will centre on whether the changes I have made constitute micro-optimisations or not. Another reason for avoiding the debate in the first place. The changes I have made include.

    • Constantising parameter access.
    • Removal of temporary variable usage.
    • Common sub-expression elimination.
    • Re-coding of if, for(;;), while statements into their modifier forms
    • Merging of consecutive loops and/or nested loops into single loops or grep or map forms.
    • Moving invarient expressions, sub-expressions and statements from inside loops to outside.
    • In two cases only, manually in-lining two trivial functions where they are called in only one place and (IMO) served no purpose by way of clarification of the code and were therefore, pure overhead.

      The two functions are compDZ() and setPixel

    I would characterise all these changes under the heading "re-factoring" as no functionality has been added, changed, or removed, nor been moved. I believe all these types of changes, have at various times by various people, been described as "micro-optimisations", and in that way been frowned upon. I did not set out to disprove this advice. I set out to assist a fellow monk with his question, and discovered that the edit/run cycle took so long that it discouraged me (and others if you look at the original thread). So, as I had an interest in seeing how the Z-buffer algorithm had been implemented, I set about trying to reduce the cycle time using simple changes. During this process, I found many places in the code where a single line would call several accessor subs. Each of these subs was shifting the argument list into named vars. Often these named vars were then passed as parameters to nested calls where they again were being shifted into another set of named vars. I originally recoded these accesses using the $_[n] nomenclature but that rapidly became confusing especially in subs where there were 4, 5 or 6 parameters.

    I then hit upon the idea of using constant subs to give meaningful names to the parameters. I had recently read somewhere that constant subs are inlined by the compiler and therefore have no more overhead than coding the constant itself and when I benchmarked a small testcase, this proved to be the case. I felt that as I hadn't seen this "discovery" described anywhere else, it was worth sharing with the community and I made my post.

    The controversy, such as it is, centers on the (obviously mathematically impossible) idea that a 25% reduction in attribute access could in some way be responsible for a "speed up factor of 6". (That phrase is in quotes because I never made that claim.) The obvious answer is "It cannot". And the reason I left the original question unanswered is that the answer is so obvious that I assumed it to be rhetorical.

    In hindsight, the phrase "Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs." from my original post should have read "Contributing to a reduction in the draw time of the cube, from around 2 minutes (on my box) to around 20 secs.".

    Had that read "from exactly 125.2 seconds to exactly 20.7", then I could have seen the source of the misunderstanding and the need for this belated re-examination. As it stands, if anyone feels that they where fooled by an obvious mathematical impossibility or that the idea of 'Constantising parameter access' was unworthy of a wider audience then I apologise.

    Likewise, if anyone believes that understanding the costs and benefits of TIMTOWTDI, and using them to achieve a reduction in runtime of one, non-trivial, real-world application from 125 seconds to under 4 is not worthy of consideration, I again apologise.

    Finally, if any monk feels that I profited unduly (in the oft-described, 'meaningless', XP stakes) from my original post (150 currently), then I invite them to downvote this post in recompense. Or if sufficient numbers believe that to be the case, I am quite sure that it is possible to have that number or any abitrary "fine" of XP deducted from my tally. I will not argue the point regardless.

    Please find attached my crude timings of the original and modified programs along with the modified code. I did set out to diff the two and highlight each of the changes and give my justification for why each change was, in and of itself, a micro-optimisation, but I got bored with the whole laborious process and gave up. Sorry.

    It might be worth noting that I did go on to make several more, decidedly non-trivial modifications when I was doing this originally. This included such things as

    • Using arrays and constants instead of hashes and named elements as the basis of the low-level classes.
    • Re-writing all the methods of those classes to use direct access to the instance data internally instead of the accessor functions.
    • Re-structuring the Z-Buffer itself to be a single linear array instead of an array of arrays.

    Each of these steps had a further benefit on performance, but the main blockage had moved to the Tk code and that was beyond my (self-assigned) breif to modify. I think that doing the drawing into a bitmap and blitting this to the canvas would allow the performance to be reduced by (possibly) an order of magnitude, as the overhead inherent in drawing using discrete, hit-testable lines does nothing for the application as it stands.

    Original code - timing and profile

      Thank you. That clears things up quite nicely to my mind.

      A "micro optimization" is where you take a small operation (hence "micro") and make it run a bit faster (in hopes of making your program run faster). This is nearly doomed to be quite disappointing (as far as that hope is concerned).

      So let's look at your optimizations (not in quite the same order as you listed them).

      • Constantising parameter access.
      • Removal of temporary variable usage.
      • Re-coding of if, for(;;), while statements into their modifier forms
      • [...] manually in-lining two trivial functions

      Yes, these are micro optimizations.

      • Moving invarient expressions, sub-expressions and statements from inside loops to outside.

      No, this is not a micro optimization. Instead of making a tiny operation a bit faster, you are reducing how many times you run that tiny operation. So, this still has the disadvantage of the operations being tiny (so the percentage of the total run time being affected is likely quite small), but it has the advantage of instead of reducing this run time by something like 20%, you are instead cutting it by some large fraction. If the loop was being iterated 100 times, then you've cut the run time 100 fold.

      The affect on the total run time is still likely to be relatively small, simply because the tiny operation is unlikely to constitute a large fraction of the total run time. If it, for example, was 10% of the total run time, then you might speed up the run nearly a full 10%.

      • Common sub-expression elimination.

      I'm not quite sure what you mean by this. It might be a micro optimization or it might be similar to the previous item.

      • Merging of consecutive loops and/or nested loops into single loops or grep or map forms.

      (Emphasis mine.) Here we have a mix of micro optimizations and very non-micro optimizations. Changing the nesting of loops is an important optimization technique. It is the type of thing were you can cause the entire body of a loop to be run a small fraction of the number of times that it used to be run.

      The bodies of the loops could account for a huge percentage of the original total run time and you can reduce how many times they get run by a huge fraction. This is the kind of thing that can make programs run 10s or 100s of times faster.

      I'm sorry that I can't take the time right now to analyze the changes in detail. For now I'll assume that this explains things. If others take time to study this in more detail, I will be happy to see it and interested in their findings.

      Thank you very much, BrowserUk, for going through the effort to present this information.

                      - tye

      Just wanted to comment on the individual changes:

      Update: tye uses a precise definition of "micro optimization" in his post below and arrives at better points. Read that post first, it obsoletes most of mine.

      Makeshifts last the longest.

Re: Re^2: Micro optimisations can pay off, and needn't be a maintenance problem (I don't believe it)
by BrowserUk (Patriarch) on Dec 28, 2002 at 00:57 UTC

      I'm sorry you took this personally and so hard. I was making a strong attack on the impression that I felt the node was giving people. This was certainly not meant as an attack on you. But you took it that way, so I'm sorry for giving you that impression.

      I hope you still find the strength to post the before/after code (like you mentioned in the chatterbox that you would) so that the community can look at the changes and decide where the huge speed gain came from.

                      - tye

        Not personal, nor hard. I simply responded in-kind as always.

        As has been the subject of a recent thread, there are ways and ways of making a point.

        With nine references to my handle, this does smack of personal to me, but I will overlook that. Your decision to abandon the chatterbox debate in mid-flow and publicly promote your point in this way ... also.

        As for "finding the strength", hmm. I wonder what you are trying to imply there?

        The work involved in recreating the original scenario is considerable in as much as I went on to make large scale changes to my copy of the original code. I either have to back them out, or start with that original code and try to re-create the position I arrived at when I "discovered" the ...

        use constant ARG1 => 0; ... sub something{ $something = $_[ARG1]; }

        ... mechanism for using subroutine parameters directly, without giving up the benefit of meaningful names which was the entire reason for, and the only claim I intended to make in the original post.

        I have started to try and re-create the original situation at which I took my rough timings, and if it doesn't prove to be too frustrating a task, I may complete it, and I may post it so that the community may indeed decide for themselves if the micro optimisations involved can pay off under some circumstances.


        Examine what is said, not who speaks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://222651]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-25 15:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found