Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Optimising processing for large data files.

by BrowserUk (Pope)
on Apr 10, 2004 at 06:29 UTC ( #344087=perlmeditation: print w/ replies, xml ) Need Help??

There have been several posts recently by people looking to process large data files more efficiently than they can achieved using the default methods and standard paradigms perl provides.

There is a prevalence hereabouts to say, "Don't optimise!". Throw bigger hardware at it. Use a database. Use a binary library. Code in C.

  • But if your already using top end hardware and it's still too slow, what then?
  • Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.
  • If there is a binary library available to perform your task, with a perlish interface, that doesn't have a huge learning curve or require you to structure your entire program around it's, often Byzantine and/or non-intuative data structures and control mechanisms, then you may be in luck.
  • If you have a C compiler. Know how to code in C. Don't mind spending days (or weeks?) re-inventing complex data structures or finding pre-existing ones and trying to make them work together. Oh, and be sure to take a crash (sic) course in using a symbolic debugger. Not that it will do you that much good because almost all efficient C code makes extensive use of inlining through the use of the macro preprocessor, and by the time you get into the symbolic debugger what you see there bares little or no resemblance to the source code anyway.
  • Of course, you could learn C++...but life's short and a rusty knife across your own throat is more fun:)

Your other option is to look at making best--eg. most efficient--use of the facilities that Perl provides in order to speed things up and/or use less memory.

As an example of the sorts of techniques that perl provides for doing this, I'll use a recent post as an example. Not withstanding the fact that some people recognised the form of the data to be processed and pointed out that there may well be an existing wheel tailor-made for that particular case, the question as asked has generic application beyond that particular specialisation. It is also simply stated and makes for a good example to work from.

#! perl -slw use strict; my $ch; open (FH, '<', $ARGV[ 0 ]) or die $!; until (eof(FH)) { $ch = getc(FH); } close FH;

The task is to process every byte of a very large file in a reasonable amount of time. The example code supplied used fairly naive coding and as a result, for the stated data volume (3GB) I project that it would take ~12 hrs 40 minutes to run on my hardware (2.66 GHz P4). It's possible that using a top-of-the-line PC that this might be reduced by half through shear (single) processor power and otherwise faster hardware. You might reduce it to a third by multiprocessing, or bigger iron--maybe. But the cost in complexity of the former and the simple cost of the latter would be prohibitive for many situations. So what can we do to speed this up with out resort to that expense.

The first thing to notice is that since 5.6.2?, perl has the smarts built-in to handle unicode data in files. This is a good thing if you need to process unicode data, but it extracts a penalty for those who do not. However, this penalty, along with some penalties imposed by the underlying C-runtime for handling formatted ASCII data can be quickly and easily side-stepped by specifying that you want to process the file as simple bytes. The mechanism is to specify ':raw' when opening the file.

#! perl -slw use strict; my $ch; open (FH, '< :raw', $ARGV[ 0 ]) or die $!; until (eof(FH)) { $ch = getc(FH); } close FH;

This simple change results in a reduction of the (projected) runtime from 12 1/2 hours to something under 6 (~5:45)! Not bad for the sake of typing 4 characters.

The next thing I looked at was getch(). Part of the stdio (emulation?), this has overhead built in. For example, to allow the eponymous ungetch().

What happens if we bypass this by using sysread(), to get one character at a time?

#! perl -slw use strict; my $ch; open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( sysread( FH, $ch, 1 ) ) { } close FH;

The result is a projected runtime of 1 hour 23 minutes!

Not bad! down to 10% of the time with around 20 characters changed from the original program. That's a 900+% improvement of you want to look at it that way.

Of course, the original problem required that it was possible to look at each pair of bytes separated by 500 bytes. and so far, I haven't tackled this. This will obviously need a buffer of some kind, so rather than reading a single char at a time, I'll read 4096 bytes (a good choice on my OS/file system) into a scalar and then access the bytes. But how to access them? An often used method of splitting a scalar into its bytes is split, so I try that.

#! perl -slw use strict; my( $buf, $ch ); open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( sysread( FH, $buf, 1 ) ) { for $ch ( split'', $buf ) { } } close FH;

Ouch! That costs dear. As well as the cost of actually splitting the string, there is also the overhead of building a 4000 element array (which in perl is a linked list), storing one byte per element. This is rapidly consumed and then the array is discarded and a new one built. There is also an intermediate list built by split. This consumes substantial amounts of memory, and on a file this large, causes the GC collector to run with increasing frequency as the processing progresses. The upshot of this is that we're back to 4 1/2 hours of processing time, and I haven't added code to retain the last 500 chars for the comparison required. So, don't do that:)

Another popular method of separating a scalar into it chars is @chars = $scalar =~ m[(.)]g; but this suffers the same problems as split.

Let's try accessing the bytes using substr.

#! perl -slw use strict; my( $read, $buf, $ch ); open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( $read = sysread( FH, $buf, 4096 ) ) { for my $p ( 0 .. $read ) { $ch = substr $buf, $p, 1; } } close FH;

Time taken 31 minutes 40 seconds. Nice. Close to a third of the time we took when processing the file byte-by-byte, despite the fact that the OS is undoubtedly doing some buffering for us under the covers. Making 98% (1/4096) less system calls has benefits, even if the OS is caching the same size buffer of data. We've also avoided the need to construct and garbage collect 3/4 of a million, 4000-element linked lists by not splitting the string.

But, we still haven't retained the 500 character buffer required. Several suggestions where made for how to do this is the thread, mostly to do with pushing and shifting to an array. We've already seen that splitting the string to an array imposes a substantial penalty. You will have to take my word for it that maintaining a buffer of 500 chars by pushing to one end of an array and shifting off the other takes a loooooong time. I can't give figures as even using the 30MB file that I used as the basis for projecting most of the times in this post, I gave up waiting for it after over 3 hours. The original code that I project would have processed a 3GB file in 12 1/2 hours, took less than 8 minutes for 30MB.

Another suggestion muted was to use a sliding buffer. This basically involves filling a buffer, processing as far as you can, then copying the required residual bytes from the end of the buffer to the beginning, re-filling the remainder of the buffer and processing again. Repeat until done. It's harder to describe than code.

#! perl -slw use strict; my( $read, $buf, $ch, $pair ); my $offset = 0; open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( $read = sysread( FH, $buf, 4096 - $offset, $offset ) ) { for my $p ( 500 .. $read ) { $pair = substr( $buf, $p - 500, 1 ) . substr( $buf, $p, 1 ); } $buf = substr( $buf, $read - 500, 500 ); $offset = 500; } close FH;

That's the entire program. The variable $pair iteratively takes on the value of each pair of bytes, separated by 500 chars, throughout the entire input file. And the time taken to process 3GB?

A second or two under 31 minutes (*). That's less than 4% of the original runtime or if you prefer, 2500+% improvement. And this is retaining the 500 char cache that the original program didn't.

And note. The whole 3GB file has been processed through a single, 4096 byte buffer. The process consumed less than 2MB of memory total. There was no memory growth and the GC never had to run.

So, what price optimisation? You may be able to get close to these results using the original program if you spent $50k on new hardware?

You certainly wouldn't achieve it by putting the data into a DB.

You might even improve the performance by writing the final program in C, but I doubt that a brute force conversion of the original into C would be as quick. Even then, you are then stuck with coding the rest of the program--whatever needs to be done with those pairs of chars--in C also. Devoid of all the nice things perl gives us.

It could be pointed out that the original program was overly naive, but my counter to that is that without knowledge of where the inherent costs arise in a program, ALL programs are naive. And one way to become familiar with the perpetual trade-offs between ease and efficiency is to experiment. Or to put it another way, to optimise.

Besides, optimising is fun!

* Actual time to process a 3GB file: 30:57 minutes. This is within 1% of the value projected from a run of the same code on the 30MB file used elsewhere. Several runs of each version of the code were timed using the 30MB file, and projections confirmed by a single run on a 300MB file for all but the first two which would take longer than I was prepared to wait:)


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Edited by BazB: enclosed more of the node in readmore tags.

Comment on Optimising processing for large data files.
Select or Download Code
Re: Optimising processing for large data files.
by Anonymous Monk on Apr 10, 2004 at 10:13 UTC

    BrowserUK++. Just one typo within snippet #4. I assume you meant to sysread() in 4096 bytes before split()ing it into characters.

      Indeed, that was a c&p error. Corrected. Thanks.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      One more small typo: GC collector in the paragraph talking about split. It's a pet peeve of mine, right up there with ATM machine and NIC card (although NI card might sound a little silly).

      Great thread BTW, I'm impressed by the level of detail to which you have pursued your problem.

Re: Optimising processing for large data files.
by tachyon (Chancellor) on Apr 10, 2004 at 13:10 UTC

    Good optimisation. In C you can of course avoid the copy and move overhead of the substr buffer you use and just flip pointers between a pair of buffers to get the sliding window.

    Runtime on a 1GHz laptop was 10 minutes on a 3GB test file. So the benefits of doing it in C are real but perhaps hardly worth the effort unless saving 20 minutes runtime for adding X minutes coding time makes sense.

    $ cat file.c #include <stdio.h> #define FILENAME "c:\\test.txt" #define CHUNK 500 int main() { FILE *f; char buf1[CHUNK],buf2[CHUNK],pair[3],*fbuf,*bbuf,*swap; int r, i; f=fopen(FILENAME,"r"); if (!f) return 1; fbuf=buf1; bbuf=buf2; r=(int)fread( fbuf, sizeof(char), CHUNK, f ); if ( !r || r<CHUNK ) return 1; pair[2]=0; while ( (r=(int)fread( bbuf, sizeof(char), CHUNK, f )) ) { for( i=0;i<r;i++ ) { pair[0]=fbuf[i]; pair[1]=bbuf[i]; /* printf("%s\n",pair);*/ } /* Move old back buffer pointer to front buffer ptr * And vice versa. Net effect is to slide buffer->R * As we will refil the back buffer with fresh data. * Thus we simply pour data from disk to memory with * no wasted copying effort. */ swap=fbuf; fbuf=bbuf; bbuf=swap; } fclose(f); return 0; }

    cheers

    tachyon

Re: Optimising processing for large data files.
by tilly (Archbishop) on Apr 10, 2004 at 13:24 UTC
    There is a prevalence hereabouts to say, "Don't optimise!". Throw bigger hardware at it. Use a database. Use a binary library. Code in C.

    I don't dispute that some people say that. But what the most experienced people say is, "Don't prematurely optimize!". The key word being "prematurely".

    Usual estimates are that most programs spend about 80% of their time in 20% of the code, and about 50% of the time in just 4% or so. This suggests that improving a very small fraction of the program after the fact can yield dramatic performance improvements. Those estimates actually date back to studies of FORTRAN programs decades ago. I've never seen anything to suggest that they aren't decent guidelines Perl programs today. (And they are similar to, for instance, what time and motion studies found about speeding up factory and office work before computers.)

    With that relationship, you want to leave optimization for later. First of all because in practice later never needs to come. Secondly if it does come then, as you demonstrated, there often are big wins available for relatively little work. However the wins are not going to always be in easily predicted places. Therefore the advice to leave optimizing until you both know how much optimizing you need to do and can determine where it needs to be optimized.

    You didn't need profiling tools in this case because the program was small and simple. In a larger one, though, you need to figure out what good places to look at are. After all if 50% of your time is really spent in 4% of your code, there is no point in starting off by putting a lot of energy into a section that you don't know affects that 4%. (Note that when you find your hotspots, sometimes you'll want to look at the hotspot, and sometimes you'll want to look at whether you need to call it where it is being called. People often miss the second.)

    Also improvements are often platform specific. For instance you got huge improvements by using sysread rather than read. However in past discussions people have found that which one wins depends on what operating system you are using. So for someone else, that optimization may be a pessimization instead. Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.

    Now some corrections. You made frequent references in your post to theories that you have about when GC was likely to run. These theories are wrong because Perl is not a garbage collected language.

    Also your slams against databases are unfair to databases. First of all the reasons to use databases often have little or nothing to do with performance. Instead it has to do with things like managing concurrent access and consistency of data from multiple applications on multiple machines. If that is your need, you probably want a database even if it costs huge performance overhead. Rolling your own you'll probably make mistakes. Even if you don't, by the time you've managed to get all of the details sorted out, you'll have recreated a database, only not as well.

    But even on the performance front you're unfair. Sure, databases would not help with this problem. It is also true that most of the time where databases are a performance win, they win because they hand the application only the greatly reduced subset of the raw data that it really needs. But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine, which tends to optimize better than most programmers know how to. (Indeed an amusing recurring performance problem with good databases is that programmers think that an index is "obviously better" and go out of their way to force the database to use one when they are better off with a full table scan.)

      All excellent points, except that perl does use garbage collection for memory management, so that Perl is effectively a garbage-collected language. perl uses a refcount-based scheme; perlobj talks about the two-phase garbage collection scheme.

      I am guessing that you know this already. Perhaps it doesn't fit your definition of GC?

      -Mark

        I knew this, and no, I don't consider it true GC. (Circular references cause you to leak memory at runtime.)

        It certainly wasn't a GC algorithm in the sense that BrowserUK was referring to, something that would randomly halt your program while it went through a garbage collection phase.

      But what the most experienced people say is, "Don't prematurely optimize!".

      I wasn't addressing those understand Tony Hoare's assertion, just those that don't.

      Also improvements are often platform specific.

      Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ.

      For instance you got huge improvements by using sysread rather than read. However in past discussions...

      Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x

      Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.

      Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too.

      In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will:

      1. Either notice and correct the problem and be more careful with the application of random optimisations that they read about in the future.

        I'd consider that a long term win.

      2. They won't notice.

        In which case their program probably didn't need to be optimised in the first place, and there is no harm done.

      Now some corrections.

      You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS.

      Also your slams against databases are unfair to databases.

      I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways.

      But even on the performance front you're unfair. Sure, databases would not help with this problem.

      I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database.

      But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine.

      So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing.

      I see no conflict between that and what I said.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
        But what the most experienced people say is, "Don't prematurely optimize!".
        I wasn't addressing those understand Tony Hoare's assertion, just those that don't.
        If you wanted to address those that don't understand Tony Hoare's assertion, why not make sure that they understand the key part of it - don't optimize until you need to?

        Also improvements are often platform specific.
        Of course. Should anyone with access to a.n.other OS and a couple of hours to spare care to run my tests on their OS, I'd be interested to see how much they differ.
        But it is good to point this out. Far too often people faily to understand that their optimizations do not always optimize for someone else.
        For instance you got huge improvements by using sysread rather than read. However in past discussions...
        Going by the date of the past discussions, I could well imagine that they were before the addition of the extra IO layers that caused the majority of the slowdown post 5.6.x
        That seems likely.

        Someone who reads your post and begins a habit of always using sysread has taken away the wrong lesson.
        Anyone who does that hasn't read my post--at least not properly. But I have to say that I have more faith in people than you seem to have. I'm just an average programmer and if I can work these things out out, most others can too.
        What is relevant is not how average you are or aren't, but rather what you have learned. If you want others to learn what you have learned, then it is good to constantly repeat the basic points that you have learned. Such as the value of benchmarking, and the knowledge that what is an optimization in one case might not be in another.
        In the worst case, that someone will use what I describe wrongly and their program will run more slowly. If this happens, they will:

        1. Either notice and correct the problem and be more careful with the application of random optimisations that they read about in the future.

          I'd consider that a long term win.

        2. They won't notice.

          In which case their program probably didn't need to be optimised in the first place, and there is no harm done.

        The time wasted and extra obscurity in their programs counts as harm in my books. So does the time that people like me will take correcting their later public assertions about the value of their optimizations.

        Now some corrections.
        You only make one correction, and it appears to be a matter of terminology rather than substance. Whether you class perl's reference counting/variable destruction as garbage collection or not. I have demonstrated in the past that, under Win32 at least, when large volumes of lexical variables are created and destroyed, some memory is frequently returned to the OS.
        I made 2 corrections. One on when databases are a performance improvement, and one on the presence of a garbage collector in Perl.

        Note that the return of memory to the OS has nothing to do with whether you are using true garbage collection. True garbage collection (to me) refers to having a garbage collector that is able to detect any kind of unused memory, including circular references, and free it for reuse. The use of such a collector is independent of the decision to try to return once used but now free memory to the OS.

        Also your slams against databases are unfair to databases.
        I didn't make any "slams" against databases. Only against advice that they be used for the wrong purpose, and in the wrong ways.
        I think we can simplify this specific disagreement to my just saying, I disagree with you on what are the wrong ways to use a database. We have yet to determine that you'll agree with my disagreement when you understand it.
        But even on the performance front you're unfair. Sure, databases would not help with this problem.
        I was only discussing this problem. By extension, I guess I was also discussing other problems of this nature, which by definition, is that class of large volume data processing problems that would not benefit from the use of a database.
        No, you were not only discussing this problem. You also included a false assertion about when databases can give a performance improvement. That assertion is what I'm trying to correct.

        But databases are often a performance win when they don't reduce how much data you need to fetch because they move processing into the databases query engine.
        So what your saying is that instead of moving the data to the program, you move the program to the data. I whole heartedly agree that is the ideal way to utilise databases--for that class of programs that can be expressed in terms of SQL. But this is just an extreme form of subsetting the data. Potentially to just the results of the processing.
        You're correct in the first sentence. This is a case where necessary work is being pushed to the database, and it only works if the work that you need can be conveniently expressed in SQL. But no, this is not just an extreme form of subsetting the data, and the fact that it isn't is why it contradicts your claim that, Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.

        Consider the case where you have a very large table, and several smaller tables. You need a dataset that is the result of a join between most of the large table, and the other tables. The raw information that you need is the large table and the smaller tables. This is smaller than the result set that you will actually get. But if you have a good query plan, decent indexes, and a properly tuned database, you will often get through your data more quickly by having a Perl program work through the streamed result set and insert the data into another database rather than trying to get the raw data and perform the equivalent of a join in code while balancing all of the data.

        This is not an uncommon type of operation in certain situations. For instance you see it when importing data into a data warehouse. And it is one where pushing the work into the database gives you a significant performance increase over manipulating raw data in Perl, even though the database operations increase the amount of data that you are handed.

        I see no conflict between that and what I said.
        Hopefully you can see the conflict now that I've tried to clarify by making it more explicit when you might win despite receiving more data from the database than just the raw data set.
Re: Optimising processing for large data files.
by Vautrin (Hermit) on Apr 10, 2004 at 16:35 UTC
    Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.

    I recommend using databases to people not because of any kind of performance gain you get from the "database magic bullet", but because there's a lot of programming work you can cut out of the picture.

    For instance, if you need a data structure that can persist outside of your program, and be accessed and modified via multiple programs, all you have to do is create a table to represent the data structure, and all of the work is basically done for you. SQL is all you need, and it's done very easily for someone who is familiar with it.

    However, if you're rolling your own, it's going to take a lot of time, you're going to have to take care of a lot of details which are just provided for you using a database, and (if you're on a virtual host without root access) it may be the only thing you have write access to.


    Want to support the EFF and FSF by buying cool stuff? Click here.
Re: Optimising processing for large data files.
by water (Chaplain) on Apr 10, 2004 at 17:03 UTC
    So I wanted to see how a sliding buffer compared to a simple push/shift. Maybe I implemented the sliding buffer in a slow way (suggestions?), but for my (likely poor) implementation, the simple push/shift is blazingly faster than the sliding buffer.

    Did I blow the buffer implementation, or is native push/shift damn efficient? (probably both)

    use strict; use Test::More 'no_plan'; use constant SIZE => 500; use Benchmark qw(:all); ################################################ # ROLL1: sliding buffer ################################################ { my ( $last, @x ); sub init1 { $last = SIZE - 1; @x = (undef) x SIZE; } sub roll1 { my ($val) = @_; $last = ( $last + 1 ) % SIZE; $x[$last] = $val; return \@x[&order]; } sub order { my $first = ( $last + 1 ) % SIZE; return ( $first .. SIZE - 1, 0 .. $last ); } sub dump1 { return join ( '-', @x[&order] ); } } ################################################ # ROLL2: simple push and shift ################################################ { my @x; sub init2 { @x = (undef) x SIZE; } sub roll2 { my ($val) = @_; push ( @x, $val ); shift @x; return \@x; } sub dump2 { return join ( '-', @x ); } } ################################################ # ensure both return the same results ################################################ for my $roll ( 5, 19, 786 ) { &init1; &init2; for ( my $i = 0 ; $i < $roll ; $i++ ) { my $val = rand; roll1($val); roll2($val); } is( dump1, dump2, "same results for $roll rolls" ); } ################################################ # benchmark them ################################################ timethese(100, { 'roll1' => sub { init1; roll1($_) for (1..10000);}, 'roll2' => sub { init2; roll2($_) for (1..10000);}, });

      You're comparing apples and eggs--and dropping the egg basket:)

      I'll try to come up with a better explanation and post it tomorrow.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
        Yah, I thought so. My "fast" algoritm is, what, 1000x slower? So obviously I missed something important, and missed it pretty badly. <g>

        I'd welcome any advice -- not that this matter for any pressing real project, but just to improve my skills.

        Thanks, browserUK, look forward to your post....

Re: Optimising processing for large data files. (Apology and explaination)
by BrowserUk (Pope) on Apr 14, 2004 at 09:01 UTC

    This is an update on my investigations regarding using ':raw' and sysread(). It is also an apology to most of the monks. The question I have been pursuing for 3 days is:

    Why did I consistantly see such dramatic performance improvements with each of the steps outlined in my root node when (apparently) one or two others that tried this out, failed to see similar improvements?

    Having persued various avenues including memory allocation/deallocation, mismatched C-runtime libraries/.DLLs, and having completely blown away various non-standard compilers and other tools and finally re-installed perl 5.8.2.

    Despite all this, I still see the same dramatic performance increases on my system from each of the steps outlined.

    Finally, I know why!

    The reason is newlines--or rather, the absence of them.

    When I generated my testdata, the 3GB file, I did it using a one-liner than generated random length strings (max. 10,000 as mentioned in the original post upon which I based the idea) of random sequences of A, C, G & T. and kept printing new random lines until the accumulated lengths totalled 3GB+.

    Hardly efficient, but it was a one-off (all smaller test files were simply head -c 10485760 3GB.dat > 10MB.dat etc.), easy to type and I was going to watch a movie while it ran anyway.

    However, it appears that I omitted one thing, my customary -l. Which meant that none of my test files contained a single newline. And that explains everything.

    And so, whilst using ':raw' and sysread do indeed provide some fairly beneficial performance improvements (used correctly), the level of those improvements is far less dramatic than my original post showed, if the file does contain newlines.

    And so, I apologies to the community of perlmonks for this misinformation.

    My sincerest apologies, BrowserUk.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://344087]
Front-paged by cchampion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2014-08-29 12:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (280 votes), past polls