Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Re: Re: Optimising processing for large data files.

by tilly (Archbishop)
on Apr 11, 2004 at 02:40 UTC ( #344192=note: print w/ replies, xml ) Need Help??


in reply to Re: Re: Optimising processing for large data files.
in thread Optimising processing for large data files.

But what the most experienced people say is, "Don't prematurely optimize!".
I wasn't addressing those understand Tony Hoare's assertion, just those that don't.
If you wanted to address those that don't understand Tony Hoare's assertion, why not make sure that they understand the key part of it - don't optimize until you need to?

Also improvements are often platform specific.
Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ.
But it is good to point this out. Far too often people faily to understand that their optimizations do not always optimize for someone else.
For instance you got huge improvements by using sysread rather than read. However in past discussions...
Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x
That seems likely.

Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.
Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too.
What is relevant is not how average you are or aren't, but rather what you have learned. If you want others to learn what you have learned, then it is good to constantly repeat the basic points that you have learned. Such as the value of benchmarking, and the knowledge that what is an optimization in one case might not be in another.
In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will:

  1. Either notice and correct the problem and be more careful with the application of random optimisations that they read about in the future.

    I'd consider that a long term win.

  2. They won't notice.

    In which case their program probably didn't need to be optimised in the first place, and there is no harm done.

The time wasted and extra obscurity in their programs counts as harm in my books. So does the time that people like me will take correcting their later public assertions about the value of their optimizations.

Now some corrections.
You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS.
I made 2 corrections. One on when databases are a performance improvement, and one on the presence of a garbage collector in Perl.

Note that the return of memory to the OS has nothing to do with whether you are using true garbage collection. True garbage collection (to me) refers to having a garbage collector that is able to detect any kind of unused memory, including circular references, and free it for reuse. The use of such a collector is independent of the decision to try to return once used but now free memory to the OS.

Also your slams against databases are unfair to databases.
I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways.
I think we can simplify this specific disagreement to my just saying, I disagree with you on what are the wrong ways to use a database. We have yet to determine that you'll agree with my disagreement when you understand it.
But even on the performance front you're unfair. Sure, databases would not help with this problem.
I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database.
No, you were not only discussing this problem. You also included a false assertion about when databases can give a performance improvement. That assertion is what I'm trying to correct.

But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine.
So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing.
You're correct in the first sentence. This is a case where necessary work is being pushed to the database, and it only works if the work that you need can be conveniently expressed in SQL. But no, this is not just an extreme form of subsetting the data, and the fact that it isn't is why it contradicts your claim that, Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.

Consider the case where you have a very large table, and several smaller tables. You need a dataset that is the result of a join between most of the large table, and the other tables. The raw information that you need is the large table and the smaller tables. This is smaller than the result set that you will actually get. But if you have a good query plan, decent indexes, and a properly tuned database, you will often get through your data more quickly by having a Perl program work through the streamed result set and insert the data into another database rather than trying to get the raw data and perform the equivalent of a join in code while balancing all of the data.

This is not an uncommon type of operation in certain situations. For instance you see it when importing data into a data warehouse. And it is one where pushing the work into the database gives you a significant performance increase over manipulating raw data in Perl, even though the database operations increase the amount of data that you are handed.

I see no conflict between that and what I said.
Hopefully you can see the conflict now that I've tried to clarify by making it more explicit when you might win despite receiving more data from the database than just the raw data set.


Comment on Re: Re: Re: Optimising processing for large data files.
Re: Re: Re: Re: Optimising processing for large data files.
by BrowserUk (Pope) on Apr 11, 2004 at 06:48 UTC

    Example 1.

    ...nothing to do with whether you are using true garbage collection.

    I never used the phrase "true garbage collection".

    Example 2.

    You also included a false assertion about when databases can give a performance improvement.

    Wrong. To quote you: "Sure, databases would not help with this problem."

    Example 3.

    Consider the case where you have a very large table,...

    No, I will not consider that case. That case has no relevance to this discussion, nor to any assertions I made.

    My assertion, in the context of the post (re-read the title!) was:

    If you have a large volume of data in a flat file, and you need to process that data in it's entirety, then moving that data into a database will never allow you to process it faster.

    That is the assertion I made. That is the only assertion I made with regard to databases.

    Unless you can use some (fairly simple, so that it can be encapsulated into an SQL query) criteria to reduce the volume of the data that the application needs to process, moving the data into a DB will not help.

    No matter how you cut it, switch it around and mix it up. For any given volume of data that an application needs to process, reading that volume of data from a flat file will always be quicker than retrieving it from a DB. Full stop.

    No amount of what-if scenarios will change that nor correct any misassertion I didn't make.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      This thread is going nowhere, fast. I'll respond to this and let you enjoy the privilege of the last response after that.

      Example 1.
      ...nothing to do with whether you are using true garbage collection.
      I never used the phrase "true garbage collection".
      True. But you did say, The process consumed less than 2MB of memory total. There was no memory growth and the GC never had to run. In a subthread you gave an example whose behaviour suggested to you that Perl has a garbage collector that can stop the system and run GC. That implies strongly that you thought that Perl had a GC system similar to, say, Java.

      Example 2.
      You also included a false assertion about when databases can give a performance improvement.
      Wrong. To quote you: "Sure, databases would not help with this problem."
      The fact that you gave a correct statement does not change the fact that another statement was wrong. The wrong statement was, Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program. Furthermore you've defended this statement. Repeatedly.

      Example 3.
      Consider the case where you have a very large table,...
      No, I will not consider that case. That case has no relevance to this discussion, nor to any assertions I made.
      It has relevance to your statement, Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program. But since you refuse to consider the case, you won't see the relevance, and continuing to point it out has become a waste of energy.

      My assertion, in the context of the post (re-read the title!) was:

      If you have a large volume of data in a flat file, and you need to process that data in it's entirety, then moving that data into a database will never allow you to process it faster.

      And that assertion is wrong. If the nature of the processing is that you need to correlate the data with existing datasets in a manner that can conveniently be done with a join (the existing dataset can even be included at the beginning of the flatfile as a set of different blocks), then moving the join to a database can indeed improve speed. This goes double if the amount of data to be juggled is large enough that you get into memory management issues with Perl.

      Another example which comes to mind is having to sort a very large dataset. (As in several GB of data.) A lot of research has gone into efficient sorting algorithms, and a lot of that research has gone into database design. Again, moving data into the database can win.

      That is the assertion I made. That is the only assertion I made with regard to databases.

      Unless you can use some (fairly simple, so that it can be encapsulated into an SQL query) criteria to reduce the volume of the data that the application needs to process, moving the data into a DB will not help.

      If the nature of the processing that you need to do closely matches how a database is designed to work, then you can save. Exactly because the database has been built and tuned to perform the operation that you need.

      I've given an example where it happens, and I've pointed you at an area of work where people customarily run into this issue.

      No matter how you cut it, switch it around and mix it up. For any given volume of data that an application needs to process, reading that volume of data from a flat file will always be quicker than retrieving it from a DB. Full stop.
      This is obviously true, but does not logically imply your assertion. My assertion here is that there are certain kinds of operations that databases have been designed to do well (in addition to trying to fetch data), and you are not going to be able to code those operations in Perl to run more efficiently than they already do in a database.

      Obviously unless the database is a good match for what you are going to do, and Perl is not, you would be insane to add that overhead to your process.

      But if it is a match, then the database can win. Sometimes by a lot. Despite its obvious overhead.

      No amount of what-if scenarios will change that nor correct any misassertion I didn't make.
      I'm see that you're confident in your view of reality. I won't try to convince you any further at this point.

      Now you can end the thread however you wish to.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://344192]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2014-08-29 17:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (283 votes), past polls