|Problems? Is your data what you think it is?|
Re: Re: Re: Optimising processing for large data files.by tilly (Archbishop)
|on Apr 11, 2004 at 02:40 UTC||Need Help??|
If you wanted to address those that don't understand Tony Hoare's assertion, why not make sure that they understand the key part of it - don't optimize until you need to?But what the most experienced people say is, "Don't prematurely optimize!".I wasn't addressing those understand Tony Hoare's assertion, just those that don't.
But it is good to point this out. Far too often people faily to understand that their optimizations do not always optimize for someone else.Also improvements are often platform specific.Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ.
That seems likely.For instance you got huge improvements by using sysread rather than read. However in past discussions...Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x
What is relevant is not how average you are or aren't, but rather what you have learned. If you want others to learn what you have learned, then it is good to constantly repeat the basic points that you have learned. Such as the value of benchmarking, and the knowledge that what is an optimization in one case might not be in another.Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too.
In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will:The time wasted and extra obscurity in their programs counts as harm in my books. So does the time that people like me will take correcting their later public assertions about the value of their optimizations.
I made 2 corrections. One on when databases are a performance improvement, and one on the presence of a garbage collector in Perl.Now some corrections.You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS.
Note that the return of memory to the OS has nothing to do with whether you are using true garbage collection. True garbage collection (to me) refers to having a garbage collector that is able to detect any kind of unused memory, including circular references, and free it for reuse. The use of such a collector is independent of the decision to try to return once used but now free memory to the OS.
I think we can simplify this specific disagreement to my just saying, I disagree with you on what are the wrong ways to use a database. We have yet to determine that you'll agree with my disagreement when you understand it.Also your slams against databases are unfair to databases.I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways.
No, you were not only discussing this problem. You also included a false assertion about when databases can give a performance improvement. That assertion is what I'm trying to correct.But even on the performance front you're unfair. Sure, databases would not help with this problem.I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database.
You're correct in the first sentence. This is a case where necessary work is being pushed to the database, and it only works if the work that you need can be conveniently expressed in SQL. But no, this is not just an extreme form of subsetting the data, and the fact that it isn't is why it contradicts your claim that, Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine.So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing.
Consider the case where you have a very large table, and several smaller tables. You need a dataset that is the result of a join between most of the large table, and the other tables. The raw information that you need is the large table and the smaller tables. This is smaller than the result set that you will actually get. But if you have a good query plan, decent indexes, and a properly tuned database, you will often get through your data more quickly by having a Perl program work through the streamed result set and insert the data into another database rather than trying to get the raw data and perform the equivalent of a join in code while balancing all of the data.
This is not an uncommon type of operation in certain situations. For instance you see it when importing data into a data warehouse. And it is one where pushing the work into the database gives you a significant performance increase over manipulating raw data in Perl, even though the database operations increase the amount of data that you are handed.
I see no conflict between that and what I said.Hopefully you can see the conflict now that I've tried to clarify by making it more explicit when you might win despite receiving more data from the database than just the raw data set.