comment on

When you originally asked this question at speed up one-line "sort|uniq -c" perl code you said that you only wanted the 10th field from an unspecified maximum number. In that case, using a regex to isolate that field alone rather than spliting them all out and then discarding all but one was an obvious way to save some cycles. Using the sliding buffer saved some more for an overall speed-up of about x4 in my tests.

You now appear to be wanting fields (0,3,4,9,17,18,31) which means that the benefits of using a regex over split are considerably lessened--though there is still some saving. Using this in conjunction with the sliding buffer--two variations on the theme, with sysread_1 giving consistantly the best results--and a buffer size of 64k seems to achieve the best results on my machine, with the main benefit seemingly coming from bypassing stdio.

The overall saving on my machine comes out at around 50%, whether this will get you close to your target of 2 minutes you have to see once you actually do something meaningful with the fields inside the loop. If not, I think you may need quicker hardware.

The file used in the tests below is (75MB) 500_000 records x 31 pipe-delimited fields of randomly generated data.

C:\test>215578 -BUFN=16 pipes.dat
1 trial of sysread (160.090s total)
1 trial of sysread2 (182.623s total)
1 trial of stdio (324.950s total)

sysread:20000 sysread2:20000 stdio:20000
[download]

Whilst I've tried various buffer sizes, the test are hardly definitive and you may well get better results with a different size (bigger or smaller)on your machine. Good luck.

Code

#! perl -slw
use strict;
use Benchmark::Timer;
use vars qw[$BUFN];

$BUFN ||= 1; # Buffer size in steps of 4096 bytes.

my $t=Benchmark::Timer->new();
my $re_fields = qr[
    ([^|]+)\|             # Capture field 0
    (?:[^|]+\|){2}        # Skip 1 .. 2
    ([^|]+)\|            # Capture 3 .. 4
    ([^|]+)\|
    (?:[^|]+\|){4}        # Skip 5..8
    ([^|]+)\|            # Capture 9
    (?:[^|]+\|){7}        # Skip 10..16
    ([^|]+)\|            # Capture 17..18
    ([^|]+)\|
    (?:[^|]+\|){11}     # Skip 19..30
    ([^|\n]+)            # Capture 31
    \n                    # Discard the newline
]ox;

my $buffer = "";

$t->start('sysread');
open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my %h1;

while (sysread($in, $buffer, 4096*$BUFN, length $buffer)) {
    while($buffer =~ m[$re_fields]mog ) { # << Regex.
        $h1{$_}++ for ($1,$2,$3,$4,$5,$6,$7);
    }
    $buffer = substr($buffer, 1+rindex($buffer, "\n"));
}
close $in;

$t->stop('sysread');

$t->start('sysread2');
open $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my %h2;

while (sysread($in, $buffer, 4096*$BUFN, length $buffer)) {
    my ($p1, $p2) = (0) x 2;
    while ($p2 = 1 + index($buffer, "\n", $p1)) {
        $h2{$_}++ for substr($buffer, $p1, $p2 - $p1) =~ $re_fields; #
+ << Regex.
        $p1 = $p2;
    }
    $buffer = substr($buffer, 1+rindex($buffer, "\n"));
}
close $in;
$t->stop('sysread2');

my %h3;

$t->start('perlio');
open $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";

while(<$in>){
    my @fields = split'[\|\n]';
    $h3{$_}++ for @fields[0,3,4,9,17,18,30];
}
close $in;
$t->stop('stdio');

$t->report;

print 'sysread:', scalar keys %h1, ' sysread2:', scalar keys %h2, ' st
+dio:', scalar keys %h3;

$h1{$_} ne $h3{$_} and print "$h1{$_} ne $h3{$_}" for keys %h3;
$h2{$_} ne $h3{$_} and print "$h2{$_} ne $h3{$_}" for keys %h3;
__END__
C:\test>215578 -BUFN=16 pipes.dat
1 trial of sysread (160.090s total)
1 trial of sysread2 (182.623s total)
1 trial of stdio (324.950s total)

sysread:20000 sysread2:20000 perlio:20000
[download]

Examine what is said, not who speaks.

1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

In reply to Re: split and sysread() by BrowserUk
in thread split and sysread() by relaxed137

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks