|Perl Monk, Perl Meditation|
Optimising processing for large data files.by BrowserUk (Pope)
|on Apr 10, 2004 at 06:29 UTC||Need Help??|
There have been several posts recently by people looking to process large data files more efficiently than they can achieved using the default methods and standard paradigms perl provides.
There is a prevalence hereabouts to say, "Don't optimise!". Throw bigger hardware at it. Use a database. Use a binary library. Code in C.
Your other option is to look at making best--eg. most efficient--use of the facilities that Perl provides in order to speed things up and/or use less memory.
As an example of the sorts of techniques that perl provides for doing this, I'll use a recent post as an example. Not withstanding the fact that some people recognised the form of the data to be processed and pointed out that there may well be an existing wheel tailor-made for that particular case, the question as asked has generic application beyond that particular specialisation. It is also simply stated and makes for a good example to work from.
The task is to process every byte of a very large file in a reasonable amount of time. The example code supplied used fairly naive coding and as a result, for the stated data volume (3GB) I project that it would take ~12 hrs 40 minutes to run on my hardware (2.66 GHz P4). It's possible that using a top-of-the-line PC that this might be reduced by half through shear (single) processor power and otherwise faster hardware. You might reduce it to a third by multiprocessing, or bigger iron--maybe. But the cost in complexity of the former and the simple cost of the latter would be prohibitive for many situations. So what can we do to speed this up with out resort to that expense.
The first thing to notice is that since 5.6.2?, perl has the smarts built-in to handle unicode data in files. This is a good thing if you need to process unicode data, but it extracts a penalty for those who do not. However, this penalty, along with some penalties imposed by the underlying C-runtime for handling formatted ASCII data can be quickly and easily side-stepped by specifying that you want to process the file as simple bytes. The mechanism is to specify ':raw' when opening the file.
This simple change results in a reduction of the (projected) runtime from 12 1/2 hours to something under 6 (~5:45)! Not bad for the sake of typing 4 characters.
The next thing I looked at was getch(). Part of the stdio (emulation?), this has overhead built in. For example, to allow the eponymous ungetch().
What happens if we bypass this by using sysread(), to get one character at a time?
The result is a projected runtime of 1 hour 23 minutes!
Not bad! down to 10% of the time with around 20 characters changed from the original program. That's a 900+% improvement of you want to look at it that way.
Of course, the original problem required that it was possible to look at each pair of bytes separated by 500 bytes. and so far, I haven't tackled this. This will obviously need a buffer of some kind, so rather than reading a single char at a time, I'll read 4096 bytes (a good choice on my OS/file system) into a scalar and then access the bytes. But how to access them? An often used method of splitting a scalar into its bytes is split, so I try that.
Ouch! That costs dear. As well as the cost of actually splitting the string, there is also the overhead of building a 4000 element array (which in perl is a linked list), storing one byte per element. This is rapidly consumed and then the array is discarded and a new one built. There is also an intermediate list built by split. This consumes substantial amounts of memory, and on a file this large, causes the GC collector to run with increasing frequency as the processing progresses. The upshot of this is that we're back to 4 1/2 hours of processing time, and I haven't added code to retain the last 500 chars for the comparison required. So, don't do that:)
Another popular method of separating a scalar into it chars is @chars = $scalar =~ m[(.)]g; but this suffers the same problems as split.
Let's try accessing the bytes using substr.
Time taken 31 minutes 40 seconds. Nice. Close to a third of the time we took when processing the file byte-by-byte, despite the fact that the OS is undoubtedly doing some buffering for us under the covers. Making 98% (1/4096) less system calls has benefits, even if the OS is caching the same size buffer of data. We've also avoided the need to construct and garbage collect 3/4 of a million, 4000-element linked lists by not splitting the string.
But, we still haven't retained the 500 character buffer required. Several suggestions where made for how to do this is the thread, mostly to do with pushing and shifting to an array. We've already seen that splitting the string to an array imposes a substantial penalty. You will have to take my word for it that maintaining a buffer of 500 chars by pushing to one end of an array and shifting off the other takes a loooooong time. I can't give figures as even using the 30MB file that I used as the basis for projecting most of the times in this post, I gave up waiting for it after over 3 hours. The original code that I project would have processed a 3GB file in 12 1/2 hours, took less than 8 minutes for 30MB.
Another suggestion muted was to use a sliding buffer. This basically involves filling a buffer, processing as far as you can, then copying the required residual bytes from the end of the buffer to the beginning, re-filling the remainder of the buffer and processing again. Repeat until done. It's harder to describe than code.
That's the entire program. The variable $pair iteratively takes on the value of each pair of bytes, separated by 500 chars, throughout the entire input file. And the time taken to process 3GB?
A second or two under 31 minutes (*). That's less than 4% of the original runtime or if you prefer, 2500+% improvement. And this is retaining the 500 char cache that the original program didn't.
And note. The whole 3GB file has been processed through a single, 4096 byte buffer. The process consumed less than 2MB of memory total. There was no memory growth and the GC never had to run.
So, what price optimisation? You may be able to get close to these results using the original program if you spent $50k on new hardware?
You certainly wouldn't achieve it by putting the data into a DB.
You might even improve the performance by writing the final program in C, but I doubt that a brute force conversion of the original into C would be as quick. Even then, you are then stuck with coding the rest of the program--whatever needs to be done with those pairs of chars--in C also. Devoid of all the nice things perl gives us.
It could be pointed out that the original program was overly naive, but my counter to that is that without knowledge of where the inherent costs arise in a program, ALL programs are naive. And one way to become familiar with the perpetual trade-offs between ease and efficiency is to experiment. Or to put it another way, to optimise.
Besides, optimising is fun!
* Actual time to process a 3GB file: 30:57 minutes. This is within 1% of the value projected from a run of the same code on the 30MB file used elsewhere. Several runs of each version of the code were timed using the 30MB file, and projections confirmed by a single run on a 300MB file for all but the first two which would take longer than I was prepared to wait:)
Examine what is said, not who speaks."Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Edited by BazB: enclosed more of the node in readmore tags.