Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
But what the most experienced people say is, "Don't prematurely optimize!".
I wasn't addressing those understand Tony Hoare's assertion, just those that don't.
If you wanted to address those that don't understand Tony Hoare's assertion, why not make sure that they understand the key part of it - don't optimize until you need to?

Also improvements are often platform specific.
Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ.
But it is good to point this out. Far too often people faily to understand that their optimizations do not always optimize for someone else.
For instance you got huge improvements by using sysread rather than read. However in past discussions...
Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x
That seems likely.

Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.
Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too.
What is relevant is not how average you are or aren't, but rather what you have learned. If you want others to learn what you have learned, then it is good to constantly repeat the basic points that you have learned. Such as the value of benchmarking, and the knowledge that what is an optimization in one case might not be in another.
In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will:

  1. Either notice and correct the problem and be more careful with the application of random optimisations that they read about in the future.

    I'd consider that a long term win.

  2. They won't notice.

    In which case their program probably didn't need to be optimised in the first place, and there is no harm done.

The time wasted and extra obscurity in their programs counts as harm in my books. So does the time that people like me will take correcting their later public assertions about the value of their optimizations.

Now some corrections.
You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS.
I made 2 corrections. One on when databases are a performance improvement, and one on the presence of a garbage collector in Perl.

Note that the return of memory to the OS has nothing to do with whether you are using true garbage collection. True garbage collection (to me) refers to having a garbage collector that is able to detect any kind of unused memory, including circular references, and free it for reuse. The use of such a collector is independent of the decision to try to return once used but now free memory to the OS.

Also your slams against databases are unfair to databases.
I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways.
I think we can simplify this specific disagreement to my just saying, I disagree with you on what are the wrong ways to use a database. We have yet to determine that you'll agree with my disagreement when you understand it.
But even on the performance front you're unfair. Sure, databases would not help with this problem.
I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database.
No, you were not only discussing this problem. You also included a false assertion about when databases can give a performance improvement. That assertion is what I'm trying to correct.

But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine.
So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing.
You're correct in the first sentence. This is a case where necessary work is being pushed to the database, and it only works if the work that you need can be conveniently expressed in SQL. But no, this is not just an extreme form of subsetting the data, and the fact that it isn't is why it contradicts your claim that, Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.

Consider the case where you have a very large table, and several smaller tables. You need a dataset that is the result of a join between most of the large table, and the other tables. The raw information that you need is the large table and the smaller tables. This is smaller than the result set that you will actually get. But if you have a good query plan, decent indexes, and a properly tuned database, you will often get through your data more quickly by having a Perl program work through the streamed result set and insert the data into another database rather than trying to get the raw data and perform the equivalent of a join in code while balancing all of the data.

This is not an uncommon type of operation in certain situations. For instance you see it when importing data into a data warehouse. And it is one where pushing the work into the database gives you a significant performance increase over manipulating raw data in Perl, even though the database operations increase the amount of data that you are handed.

I see no conflict between that and what I said.
Hopefully you can see the conflict now that I've tried to clarify by making it more explicit when you might win despite receiving more data from the database than just the raw data set.

In reply to Re: Re: Re: Optimising processing for large data files. by tilly
in thread Optimising processing for large data files. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (9)
    As of 2014-07-14 11:33 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (258 votes), past polls