comment on

You can shave a bit of processing off by slurping a big chunk of data and then splitting at all the included newlines on your own.

while(read $fh, my $buffer, 128*2**10) {
    $buffer .= <$fh>; # since last line probably crosses buffer border
    for(split /\n/, $buffer) {
        # ...
    }
}
[download]

You can save more processing time by limiting the record split, so that it knows to stop looking even when it hasn't reached the end of the input string. my @field = split /\|/, $_, 10; # each record has 10 fields Potentially much more efficient is a fairly complicated approach - caveat: your input data must not contain any format errors or it'll completely run wild.

# we assume 10 fields per record again
while(read $fh, my $buffer, 128*2**10) {
    $buffer .= <$fh>;

    # we dump all records' fields in a big pile, in which
    # every 9th element contains the last field of one record,
    # plus a newline, plus the first field of the next record
    my @in_field_heap = split /\|/, $buffer;

    while(@in_field_heap) {
        # pull the two glued fields apart
        $in_field_heap[9] =~ /^([^\n]*)\n(.*)/;

        # pull out fields of current record incl the glued one,
        # and reinject the second half of the double field
        my @field = splice @in_field_heap, 0, 10, $2;

        # replace the glued double field by its first half
        @field[9] = $1;

        # ...
    }
}
[download]

A similar approach comes to mind for your output, but my intuition is entirely undecided on whether it'll run faster or slower.

        # ...
        push @out_field_heap, @field, "\n";

    }
    # single join over the whole batch
    my $out_buffer = join "|", @out_field_heap;


    # but that means we surrounded the newlines with pipes,
    # so fix em
    $out_buffer =~ s/\|\n\|/\n/g;

    print OUTPUT $out_buffer;
}
[download]

Obviously, optimization for speed can decrease your code's legibility and maintainability fast. Be wary of whether you really need it.

Disclaimer: I benchmarked none of these. The read $fh, $buffer, $len; $buffer .= <$fh>; is known to be the fastest block slurping approach however.

If that output acceleration idea works, it might well be applicable to the split acceleration as well.

# 10 fields..
while(read $fh, my $buffer, 128*2**10) {
   $buffer .= <$fh>;
   $buffer =~ s/\n/\|/g;
   my @in_field_heap = split /\|/, $buffer;

   while(my @field = splice @in_field_heap, 0, 10) {
       # ...
   }
}
[download]

YMMV. Benchmark thoroughly.

Makeshifts last the longest.

In reply to Re: Fastest I/O possible? by Aristotle
in thread Fastest I/O possible? by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks