Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Fastest I/O possible?

by Anonymous Monk
on Aug 23, 2002 at 00:58 UTC ( #192229=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Sometimes I find myself using perl to parse and frob flatfiles that end up getting loaded via bulk loaders (sqlldr, etc) into databases. These files get BIG - like one-line records of up to 3K, and up to 14 million records across a bunch of files.

What's the best way to get the best possible I/O performance out of perl? Up till now I've been doing it the obtuse way...

foreach my $file (@files) { open(FILE, $file) or die "Nya, nya: $!\n"; while(my $line = <FILE>) { # We often use | delimiters.... my @fields = split(/|/, $line); # Do something nifty with the fields... print OUTPUT join("|", @fields); } }

This is one of those situations where if I could save a miniscule amount of time per record, it could potentially shave a half an hour off of the run time of these monster processing jobs.

What's going on behind the scenes when you read a file one line at a time? Would it be better to read big buffers, (say 100K at a shot) and then go line by line from the buffer until it's exhausted? Is there a module that already does this? How can I optimize the performance of split() in this situation?

I guess this is a classic optimization question - I've got a loop, and it's going to be run millions upon millions of times. Any suggestions on how to make the loop run faster would be greatly appreciated.

Replies are listed 'Best First'.
Re: Fastest I/O possible?
by Aristotle (Chancellor) on Aug 23, 2002 at 02:39 UTC
    You can shave a bit of processing off by slurping a big chunk of data and then splitting at all the included newlines on your own.
    while(read $fh, my $buffer, 128*2**10) { $buffer .= <$fh>; # since last line probably crosses buffer border for(split /\n/, $buffer) { # ... } }
    You can save more processing time by limiting the record split, so that it knows to stop looking even when it hasn't reached the end of the input string. my @field = split /\|/, $_, 10; # each record has 10 fields Potentially much more efficient is a fairly complicated approach - caveat: your input data must not contain any format errors or it'll completely run wild.
    # we assume 10 fields per record again while(read $fh, my $buffer, 128*2**10) { $buffer .= <$fh>; # we dump all records' fields in a big pile, in which # every 9th element contains the last field of one record, # plus a newline, plus the first field of the next record my @in_field_heap = split /\|/, $buffer; while(@in_field_heap) { # pull the two glued fields apart $in_field_heap[9] =~ /^([^\n]*)\n(.*)/; # pull out fields of current record incl the glued one, # and reinject the second half of the double field my @field = splice @in_field_heap, 0, 10, $2; # replace the glued double field by its first half @field[9] = $1; # ... } }
    A similar approach comes to mind for your output, but my intuition is entirely undecided on whether it'll run faster or slower.
    # ... push @out_field_heap, @field, "\n"; } # single join over the whole batch my $out_buffer = join "|", @out_field_heap; # but that means we surrounded the newlines with pipes, # so fix em $out_buffer =~ s/\|\n\|/\n/g; print OUTPUT $out_buffer; }

    Obviously, optimization for speed can decrease your code's legibility and maintainability fast. Be wary of whether you really need it.

    Disclaimer: I benchmarked none of these. The read $fh, $buffer, $len; $buffer .= <$fh>; is known to be the fastest block slurping approach however.

    If that output acceleration idea works, it might well be applicable to the split acceleration as well.

    # 10 fields.. while(read $fh, my $buffer, 128*2**10) { $buffer .= <$fh>; $buffer =~ s/\n/\|/g; my @in_field_heap = split /\|/, $buffer; while(my @field = splice @in_field_heap, 0, 10) { # ... } }
    YMMV. Benchmark thoroughly.

    Makeshifts last the longest.

Re: Fastest I/O possible?
by dws (Chancellor) on Aug 23, 2002 at 04:19 UTC
    This is one of those situations where if I could save a miniscule amount of time per record, it could potentially shave a half an hour off of the run time of these monster processing jobs.

    If you're running this off of Win32, you can save noticeable time by periodically defragging your drives.

    Regardless of the OS, you can save substantial time if the OUTPUT file you're writing is on a different physical drive than the datafiles you're reading. I takes a lot of time (relativly speaking) to move disk heads across the disk to do "read a bit here, write a bit there" operations. If you can rig things so that drive heads move relatively small amounts (e.g., from track to track) while reading or writing, you can win big.

    If you have to run everything off of one drive, then consider buffering your writes to OUTPUT. Perl's buffering will wait until a disk block is full before writing, but you can increase the effective buffer size by doing something like the following in your loop.

    push @buffer, join("|", @fields) . "\n"; $buffer .= "\n"; if ( --$fuse == 0 ) { print OUTPUT @buffer; @buffer = (); $fuse = $LINES_TO_BUFFER; }
    Set $LINES_TO_BUFFER to something pretty big (10000 might be a good starting point), and be sure to empty the buffer at the end of the loop.

Re: Fastest I/O possible?
by broquaint (Abbot) on Aug 23, 2002 at 01:38 UTC
    If you are going to be processing megabytes of data or more then it is probably a good idea to buffer chunks of it into RAM as it is *much* faster than processing it line by line using IO. If you plan buffering then I recommend something like this
    { # NOTE: code is untested my @chunks = (); local $/ = \102400; while(<$fh>) { my $chunk = $_; my $last_rs = rindex($chunk, $/) push @chunks, substr($chunk, 0, $last_rs); seek($fh, 1, -(length($chunk) - $last_rs)); } }
    This should read up to 100k chunks at a time but also making sure the chunk ends on a new line. As far as I'm aware there are no buffering modules like you describe (at least a quick CPAN search doesn't seem to turn anything up) so perhaps it's time to write one?


      I don't think this will help that much. There is buffering going on under the surface with stdio anyway and it should pick a good blocksize based on the device it is reading from.

      The optimimum blocksize is returned by stat as Zaxo pointed out in a recent post. You could use an approach similar to this one (although the seeks are a really bad idea) and choose a multiple of that blocksize if your lines are usually bigger than it is but you would incur the overhead of breaking it up into lines. You are probably better off leaving that to the lower level routines.

      "My two cents aren't worth a dime.";
Re: Fastest I/O possible?
by BrowserUk (Pope) on Aug 23, 2002 at 05:20 UTC

    I first thought about this idea when I was thinking about John_M._Dlugosz's related question a couple of days ago. I made some attempts to verify the benefits of it, but ran into another of the 'features' of the OS on my box. This time, that copy file/b con refuses to do anything if the file is over 32k! Not in and of itself a bad idea if you've ever subjected your office colleges to half an hour of random morse code by typing copy *.exe con by mistake on a machine that doesn't have a volume control, but providing no way to override this is unforgivable.

    Anyway, the idea. On most OS's, the system utilities should be pretty well tuned for handling fileio buffering - choosing read sizes etc. So rather than complicating your Perl scripts but re-inventing the wheel on buffering over and over, why not let the OS utilities take care of it for you? Something like:

    ... local $/=\nn; # big chunks, small chunks whatever open FH, '< copy bigbinary con |' or die $!; while(<FH>) { binmode(FH); # do whatever. }

    In this way, the system utilities knowledge of appropriate buffer sizes etc. to handle the io efficiently. Additionally, if the action inherently requires large amounts of memory, then it is return to the os when the child process terminates.

    There may also be some performance benefits from having the pipe further buffer the data, especially if the Perl program needs to process the input in small chunks.

    In John_M._Dlugosz's case, he could use (forgive my not knowing the correct syntax) something like open FH, "grep -bcs -f '$delimiter' <bigbinary |" or die $!; to find the offsets of his records and then use seek on the file to go get his data?

    This would be especially useful if the OS in question does something sensible with filesharing for processes requesting read-only access. He could hold his big file open readonly, whilst spawning seperate processes to do the searching.

    What's this about a "crooked mitre"? I'm good at woodwork!
(tye)Re: Fastest I/O possible?
by tye (Sage) on Aug 23, 2002 at 17:53 UTC

    Rather than reply to just one of the two replies that suggest doing your own buffering in Perl, I'll reply to that concept here.

    Perl does some unusual buffer management that can actually make Perl's I/O faster than I/O written in C using <stdio.h>. Unfortunately, this buffer management gets into the guts of stdio.h and so can only be done if your stdio.h data structures are pretty "normal".

    The command "perl -V:d_stdstdio" is supposed to report "d_stdstdio='undef'" if Configure couldn't verify that it was safe for Perl do these optimizations and so Perl's I/O is not "fast".

    If Perl's I/O is "fast", then the underlying C code is already doing the "read a big buffer and split it up" and doing it more than twice as fast as you could do it yourself in your Perl script. But if Perl's I/O isn't "fast", then you might be able to get nearly a two-fold speed-up by buffering and splitting yourself.

    I'd previously been told that Linux fails this verification and so Perl's I/O is more than twice as slow as it would be otherwise (nearly four times a slow?). Checking v5.6.1 on Linux, I find d_stdstdio='define', so I'm not sure if Perl has improved on this or something else is going on.

    I'd also heard that Win32 doesn't have fast Perl I/O but it also reports d_stdstdio='define'. So perhaps that test isn't very useful. /:

    So it certainly may be worth trying the buffer-and-split in Perl and see if that makes your I/O faster or slower. It is rather sad that Perl w/o "faster" I/O doesn't manage to do buffer-and-split in C as fast as can be done in a Perl script. I think that case was considered "rare" and so didn't get much attention.

            - tye (but my friends call me "Tye")
Re: Fastest I/O possible?
by sauoq (Abbot) on Aug 23, 2002 at 01:41 UTC

    Unfortunately, you probably won't squeeze much out of the code you showed us.

    If there is a real opportunity for optimization, it is likely to in the code that you represent with your comment:

    # Do something nifty with the fields...
    "My two cents aren't worth a dime.";
Re: Fastest I/O possible?
by mordibity (Novice) on Aug 23, 2002 at 14:38 UTC
    Well, this is pretty miniscule, but you did ask -- you could initialize vars outside of your loop (especially for @fields) to avoid destroying/creating a new variable each time; also, I think handling split() a fixed string '|' instead of a regex might be optimized further:
    my ($file, $line, @fields); foreach $file (@files) { open(FILE, $file) or die "Nya, nya: $!\n"; while($line = <FILE>) { @fields = split('|', $line); print OUTPUT join("|", @fields); } }
Re: Fastest I/O possible?
by fglock (Vicar) on Aug 23, 2002 at 14:29 UTC

    This will save you some fractions of a microsecond :)

    $, = "|"; print @{[1,2,3]}; instead of print join(...);

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://192229]
Approved by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2020-10-25 16:14 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (249 votes). Check out past polls.