http://www.perlmonks.org?node_id=344959


in reply to Optimising processing for large data files.

This is an update on my investigations regarding using ':raw' and sysread(). It is also an apology to most of the monks. The question I have been pursuing for 3 days is:

Why did I consistantly see such dramatic performance improvements with each of the steps outlined in my root node when (apparently) one or two others that tried this out, failed to see similar improvements?

Having persued various avenues including memory allocation/deallocation, mismatched C-runtime libraries/.DLLs, and having completely blown away various non-standard compilers and other tools and finally re-installed perl 5.8.2.

Despite all this, I still see the same dramatic performance increases on my system from each of the steps outlined.

Finally, I know why!

The reason is newlines--or rather, the absence of them.

When I generated my testdata, the 3GB file, I did it using a one-liner than generated random length strings (max. 10,000 as mentioned in the original post upon which I based the idea) of random sequences of A, C, G & T. and kept printing new random lines until the accumulated lengths totalled 3GB+.

Hardly efficient, but it was a one-off (all smaller test files were simply head -c 10485760 3GB.dat > 10MB.dat etc.), easy to type and I was going to watch a movie while it ran anyway.

However, it appears that I omitted one thing, my customary -l. Which meant that none of my test files contained a single newline. And that explains everything.

And so, whilst using ':raw' and sysread do indeed provide some fairly beneficial performance improvements (used correctly), the level of those improvements is far less dramatic than my original post showed, if the file does contain newlines.

And so, I apologies to the community of perlmonks for this misinformation.

My sincerest apologies, BrowserUk.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail