Re: Optimising processing for large data files.

There is a prevalence hereabouts to say, "Don't optimise!". Throw bigger hardware at it. Use a database. Use a binary library. Code in C.

I don't dispute that some people say that. But what the most experienced people say is, "Don't prematurely optimize!". The key word being "prematurely".

Usual estimates are that most programs spend about 80% of their time in 20% of the code, and about 50% of the time in just 4% or so. This suggests that improving a very small fraction of the program after the fact can yield dramatic performance improvements. Those estimates actually date back to studies of FORTRAN programs decades ago. I've never seen anything to suggest that they aren't decent guidelines Perl programs today. (And they are similar to, for instance, what time and motion studies found about speeding up factory and office work before computers.)

With that relationship, you want to leave optimization for later. First of all because in practice later never needs to come. Secondly if it does come then, as you demonstrated, there often are big wins available for relatively little work. However the wins are not going to always be in easily predicted places. Therefore the advice to leave optimizing until you both know how much optimizing you need to do and can determine where it needs to be optimized.

You didn't need profiling tools in this case because the program was small and simple. In a larger one, though, you need to figure out what good places to look at are. After all if 50% of your time is really spent in 4% of your code, there is no point in starting off by putting a lot of energy into a section that you don't know affects that 4%. (Note that when you find your hotspots, sometimes you'll want to look at the hotspot, and sometimes you'll want to look at whether you need to call it where it is being called. People often miss the second.)

Also improvements are often platform specific. For instance you got huge improvements by using sysread rather than read. However in past discussions people have found that which one wins depends on what operating system you are using. So for someone else, that optimization may be a pessimization instead. Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.

Now some corrections. You made frequent references in your post to theories that you have about when GC was likely to run. These theories are wrong because Perl is not a garbage collected language.

Also your slams against databases are unfair to databases. First of all the reasons to use databases often have little or nothing to do with performance. Instead it has to do with things like managing concurrent access and consistency of data from multiple applications on multiple machines. If that is your need, you probably want a database even if it costs huge performance overhead. Rolling your own you'll probably make mistakes. Even if you don't, by the time you've managed to get all of the details sorted out, you'll have recreated a database, only not as well.

But even on the performance front you're unfair. Sure, databases would not help with this problem. It is also true that most of the time where databases are a performance win, they win because they hand the application only the greatly reduced subset of the raw data that it really needs. But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine, which tends to optimize better than most programmers know how to. (Indeed an amusing recurring performance problem with good databases is that programmers think that an index is "obviously better" and go out of their way to force the database to use one when they are better off with a full table scan.)

Comment on Re: Optimising processing for large data files.

Replies are listed 'Best First'.
Re: Re: Optimising processing for large data files. by BrowserUk (Patriarch) on Apr 10, 2004 at 17:11 UTC
But what the most experienced people say is, "Don't prematurely optimize!". I wasn't addressing those understand Tony Hoare's assertion, just those that don't. Also improvements are often platform specific. Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ. For instance you got huge improvements by using sysread rather than read. However in past discussions... Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson. Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too. In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will: Either notice and correct the problem and be more careful with the application of random optimisations that they read about in the future. I'd consider that a long term win. They won't notice. In which case their program probably didn't need to be optimised in the first place, and there is no harm done. Now some corrections. You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS. Also your slams against databases are unfair to databases. I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways. But even on the performance front you're unfair. Sure, databases would not help with this problem. I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database. But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine. So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing. I see no conflict between that and what I said. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: Re: Re: Optimising processing for large data files. by tilly (Archbishop) on Apr 11, 2004 at 02:40 UTC
But what the most experienced people say is, "Don't prematurely optimize!". I wasn't addressing those understand Tony Hoare's assertion, just those that don't. If you wanted to address those that don't understand Tony Hoare's assertion, why not make sure that they understand the key part of it - don't optimize until you need to? Also improvements are often platform specific. Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ. But it is good to point this out. Far too often people faily to understand that their optimizations do not always optimize for someone else. For instance you got huge improvements by using sysread rather than read. However in past discussions... Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x That seems likely. Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson. Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too. What is relevant is not how average you are or aren't, but rather what you have learned. If you want others to learn what you have learned, then it is good to constantly repeat the basic points that you have learned. Such as the value of benchmarking, and the knowledge that what is an optimization in one case might not be in another. In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will: Either notice and correct the problem and be more careful with the application of random optimisations that they read about in the future. I'd consider that a long term win. They won't notice. In which case their program probably didn't need to be optimised in the first place, and there is no harm done. The time wasted and extra obscurity in their programs counts as harm in my books. So does the time that people like me will take correcting their later public assertions about the value of their optimizations. Now some corrections. You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS. I made 2 corrections. One on when databases are a performance improvement, and one on the presence of a garbage collector in Perl. Note that the return of memory to the OS has nothing to do with whether you are using true garbage collection. True garbage collection (to me) refers to having a garbage collector that is able to detect any kind of unused memory, including circular references, and free it for reuse. The use of such a collector is independent of the decision to try to return once used but now free memory to the OS. Also your slams against databases are unfair to databases. I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways. I think we can simplify this specific disagreement to my just saying, I disagree with you on what are the wrong ways to use a database. We have yet to determine that you'll agree with my disagreement when you understand it. But even on the performance front you're unfair. Sure, databases would not help with this problem. I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database. No, you were not only discussing this problem. You also included a false assertion about when databases can give a performance improvement. That assertion is what I'm trying to correct. But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine. So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing. You're correct in the first sentence. This is a case where necessary work is being pushed to the database, and it only works if the work that you need can be conveniently expressed in SQL. But no, this is not just an extreme form of subsetting the data, and the fact that it isn't is why it contradicts your claim that, Databases are never* quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.* Consider the case where you have a very large table, and several smaller tables. You need a dataset that is the result of a join between most of the large table, and the other tables. The raw information that you need is the large table and the smaller tables. This is smaller than the result set that you will actually get. But if you have a good query plan, decent indexes, and a properly tuned database, you will often get through your data more quickly by having a Perl program work through the streamed result set and insert the data into another database rather than trying to get the raw data and perform the equivalent of a join in code while balancing all of the data. This is not an uncommon type of operation in certain situations. For instance you see it when importing data into a data warehouse. And it is one where pushing the work into the database gives you a significant performance increase over manipulating raw data in Perl, even though the database operations increase the amount of data that you are handed. I see no conflict between that and what I said. Hopefully you can see the conflict now that I've tried to clarify by making it more explicit when you might win despite receiving more data from the database than just the raw data set.	[reply]
Re: Re: Re: Re: Optimising processing for large data files. by BrowserUk (Patriarch) on Apr 11, 2004 at 06:48 UTC
Example 1. ...nothing to do with whether you are using true garbage collection. I never used the phrase "true garbage collection". Example 2. You also included a false assertion about when databases can give a performance improvement. Wrong. To quote you: "Sure, databases would not help with this problem." Example 3. Consider the case where you have a very large table,... No, I will not consider that case. That case has no relevance to this discussion, nor to any assertions I made. My assertion, in the context of the post (re-read the title!) was: *If you have a large volume of data in a flat file, and you need to process that data in it's entirety, then moving that data into a database will never* allow you to process it faster.** That is the assertion I made. That is the only assertion I made with regard to databases. Unless you can use some (fairly simple, so that it can be encapsulated into an SQL query) criteria to reduce the volume of the data that the application needs to process, moving the data into a DB will not help. No matter how you cut it, switch it around and mix it up. For any given volume of data that an application needs to process, reading that volume of data from a flat file will always be quicker than retrieving it from a DB. Full stop. No amount of what-if scenarios will change that nor correct any misassertion I didn't make. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: Re: Re: Re: Re: Optimising processing for large data files. by tilly (Archbishop) on Apr 11, 2004 at 07:26 UTC
Re: Re: Optimising processing for large data files. by kvale (Monsignor) on Apr 10, 2004 at 15:11 UTC
All excellent points, except that perl does use garbage collection for memory management, so that Perl is effectively a garbage-collected language. perl uses a refcount-based scheme; perlobj talks about the two-phase garbage collection scheme. I am guessing that you know this already. Perhaps it doesn't fit your definition of GC? -Mark	[reply]
Re: Re: Re: Optimising processing for large data files. by tilly (Archbishop) on Apr 10, 2004 at 19:51 UTC
I knew this, and no, I don't consider it true GC. (Circular references cause you to leak memory at runtime.) It certainly wasn't a GC algorithm in the sense that BrowserUK was referring to, something that would randomly halt your program while it went through a garbage collection phase.	[reply]
Re: Re: Re: Re: Optimising processing for large data files. by BrowserUk (Patriarch) on Apr 10, 2004 at 23:09 UTC
Okay. If anyone with perl v5.8.2 (AS 808) running under XP (or similar configuration) is following this discussion, could they please run the following code under these conditions. Download the code below and save as "buk.pl". Create a datafile of say 30MB. It doesn;t matter what it contains. Start the task manager and configure it with: Click the "Performance" tab and note how much Available Physical Memory your system has. Click the "Processes" tab. View->select columns... Ensure that "Memory usage", "Memory Usage Delta" & "Virtual Memory Size" columns are all checked. Ensure that all 3 columns are visible (preferably next to each other by temporarially unchecking any intermediate ones. Check View->Update speed->High. Check Options->Always on top. Adjust the task manager window to a convenient size and position so that you can monitor it whilst running the code. Click the "CPU" column header a couple of times to ensure that the display is sorted by cpu usage in descending order. Switch to a command line and run the program. `buk datafile` Watch the 3 memory columns for perl.exe (should become the top item if you followed the above directins and don't have any other cpu intensive processes running) as the program runs. Watch carefully, and note how the "Mem Usage" figure steadily rises for a short period before suddenly dropping back. The "Mem Delta figure will become negative (the value displayed in braces) each time the "Mem usage" figure falls back. Note that the "VM Size" value tracks the "Mem Usage" closely whilst being slightly larger, and grows steadily for a short period before falling back in step with "Mem Usage". Note that each time it falls back it doesn't fall as far as it grew, resulting in an overall steady increase in the memory usage. Note that the frequency and size of the fallbacks seems to grow ever larger, and more frequent with time. Once you have seen enough, ^C the program. Don't allow the "Mem Usage" value to approach the "Physical Memory Available" figure as by then you will have moved into swapping and the picture becomes confused as the OS starts swapping memory from other processors to disk and all the Mem Delta figures start showing up (negative)decreases. I'd be really grateful if at least one other person could confirm that they too see the behaviour described. `#! perl -slw use strict; my @cache; open( FH, '< :raw', $ARGV[ 0 ]) or die $!; while( <FH> ) { push @cache, split '', $_; my $pair = shift( @cache ) . $cache[ 499 ] for 0 .. $#cache - 500; } close FH;` [download] Assuming that this behaviour isn't a figment of my imagination and is confirmed by other(s), then if anyone has a better explaination of the (temporary, but often substantial) reductions in perl.exe's memory usage, other than Perl periodically freeing heap memory back to the OS as part of some "garbage collection like" process, I'm ready to eat my hat and apologise for misleading the monks. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Re: Re: Re: Re: Optimising processing for large data files. by tilly (Archbishop) on Apr 11, 2004 at 01:54 UTC
Re: Re: Re: Re: Re: Re: Optimising processing for large data files. by BrowserUk (Patriarch) on Apr 11, 2004 at 06:05 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re: Re: Re: Optimising processing for large data files. by toma (Vicar) on Apr 11, 2004 at 19:55 UTC


Syntactic Confectionery Delight
	PerlMonks

Re: Optimising processing for large data files.

Example 1.

Example 2.

Example 3.