Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
There is a prevalence hereabouts to say, "Don't optimise!". Throw bigger hardware at it. Use a database. Use a binary library. Code in C.

I don't dispute that some people say that. But what the most experienced people say is, "Don't prematurely optimize!". The key word being "prematurely".

Usual estimates are that most programs spend about 80% of their time in 20% of the code, and about 50% of the time in just 4% or so. This suggests that improving a very small fraction of the program after the fact can yield dramatic performance improvements. Those estimates actually date back to studies of FORTRAN programs decades ago. I've never seen anything to suggest that they aren't decent guidelines Perl programs today. (And they are similar to, for instance, what time and motion studies found about speeding up factory and office work before computers.)

With that relationship, you want to leave optimization for later. First of all because in practice later never needs to come. Secondly if it does come then, as you demonstrated, there often are big wins available for relatively little work. However the wins are not going to always be in easily predicted places. Therefore the advice to leave optimizing until you both know how much optimizing you need to do and can determine where it needs to be optimized.

You didn't need profiling tools in this case because the program was small and simple. In a larger one, though, you need to figure out what good places to look at are. After all if 50% of your time is really spent in 4% of your code, there is no point in starting off by putting a lot of energy into a section that you don't know affects that 4%. (Note that when you find your hotspots, sometimes you'll want to look at the hotspot, and sometimes you'll want to look at whether you need to call it where it is being called. People often miss the second.)

Also improvements are often platform specific. For instance you got huge improvements by using sysread rather than read. However in past discussions people have found that which one wins depends on what operating system you are using. So for someone else, that optimization may be a pessimization instead. Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.

Now some corrections. You made frequent references in your post to theories that you have about when GC was likely to run. These theories are wrong because Perl is not a garbage collected language.

Also your slams against databases are unfair to databases. First of all the reasons to use databases often have little or nothing to do with performance. Instead it has to do with things like managing concurrent access and consistency of data from multiple applications on multiple machines. If that is your need, you probably want a database even if it costs huge performance overhead. Rolling your own you'll probably make mistakes. Even if you don't, by the time you've managed to get all of the details sorted out, you'll have recreated a database, only not as well.

But even on the performance front you're unfair. Sure, databases would not help with this problem. It is also true that most of the time where databases are a performance win, they win because they hand the application only the greatly reduced subset of the raw data that it really needs. But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine, which tends to optimize better than most programmers know how to. (Indeed an amusing recurring performance problem with good databases is that programmers think that an index is "obviously better" and go out of their way to force the database to use one when they are better off with a full table scan.)

In reply to Re: Optimising processing for large data files. by tilly
in thread Optimising processing for large data files. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others cooling their heels in the Monastery: (3)
    As of 2020-02-23 02:31 GMT
    Find Nodes?
      Voting Booth?
      What numbers are you going to focus on primarily in 2020?

      Results (102 votes). Check out past polls.