I hope this reply isn't too long, but I'm a fan of disclosing source code to back up results.
The short of it is that I get 50 MB/sec when processing files line/by/line in Perl.
Perl file performance is near and dear to my heart, since I routinely work on multi-gigabyte files. I wrote a benchmark program a little while ago to help me to stay with Perl, because performance was tempting me to go to C++ or to bypass Perl's buffering and do it myself (with large sysread calls).
I ran this on a file that was exactly 100 MB long, with lots of small lines, so a somewhat worst-case for a naive line-at-a-time approach. This is a UTF-8 file, and I was particularly interested to figure out why my unicode-file reading was so pitifully slow on a Windows machine.
So my fix was to start specifying ":raw:perlio:utf8" on my file handles, and I got a 6x improvement in speed.
Line-at-a-time, default layers
100.0 MB in 12.012 sec, 8.3 MB/sec
Line-at-a-time, :raw:perlio
100.0 MB in 1.837 sec, 54.4 MB/sec
Line-at-a-time, :raw:perlio:utf8
100.0 MB in 2.021 sec, 49.5 MB/sec
Line-at-a-time, :win32:perlio
100.0 MB in 1.805 sec, 55.4 MB/sec
Slurp-into-scalar, default layers
100.0 MB in 0.182 sec, 550.1 MB/sec
Slurp-into-scalar, :raw:perlio
100.0 MB in 0.065 sec, 1548.0 MB/sec
Slurp-into-scalar, :raw:perlio:utf8
100000000 on disk, 99999476 in memory
100.0 MB in 0.129 sec, 778.1 MB/sec
Slurp into scalar with sysopen/sysread (single read)
100.0 MB in 0.034 sec, 2976.2 MB/sec
Here's the code. Yes, pretty crude, but it was enough to tell me what I was doing wrong - PerlIO is the win.
The ridiculously large numbers are because the file gets into the Win32 file cache and stays there. That's actually a plus for my benchmark because it shows me where my bottlenecks are. The large sysread numbers are because no postprocessing is being done, e.g. breaking the file up into lines. Since 55 MB/sec is enough for me at the moment, I'm not looking at writing my own buffering/line processing code just yet.
But it also shows that perlio is imposing a tax compared to pure sysread. So maybe someday I'll look at the PerlIO code and see if there's some useful optimizations that won't pessimize something else.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Fcntl qw();
use Time::HiRes qw();
my $testfile = shift or die "Specify a test file";
die "$testfile doesn't exist" unless -f $testfile;
my @benchmarks = (
\&bench1_native,
# \&bench1_raw,
# \&bench1_mmap,
\&bench1_raw_perlio,
\&bench1_raw_perlio_utf8,
\&bench1_win32,
\&bench2_native,
\&bench2_raw_perlio,
\&bench2_raw_perlio_utf8,
\&bench3
);
foreach my $bench (@benchmarks)
{
my ($secs, $bytes, $lines) = $bench->($testfile);
my $mb = $bytes / 1_000_000;
print sprintf(" %.1f MB in %.3f sec, %.1f MB/sec\n", $mb, $secs,
+$mb / $secs);
print sprintf(" %1.fK lines, %.2f KL/sec\n", $lines / 1_000, ($li
+nes / 1_000) / $secs) if defined($lines);
}
# ------------------------------------------------------------------
# Read a line at a time with <fh>
sub bench1_native { return bench1_common(@_, "Line-at-a-time, default
+layers", "<"); }
sub bench1_raw { return bench1_common(@_, "Line-at-a-time, :raw", "<:r
+aw"); }
sub bench1_mmap { return bench1_common(@_, "Line-at-a-time, :raw:mmap"
+, "<:raw:mmap"); }
sub bench1_raw_perlio { return bench1_common(@_, "Line-at-a-time, :raw
+:perlio", "<:raw:perlio"); }
sub bench1_raw_perlio_utf8 { return bench1_common(@_, "Line-at-a-time,
+ :raw:perlio:utf8", "<:raw:perlio:utf8"); }
sub bench1_win32 { return bench1_common(@_, "Line-at-a-time, :win32:pe
+rlio", "<:win32:perlio"); }
sub bench1_common
{
my ($file, $prompt, $discipline) = @_;
print "\n$prompt\n";
open(my $fh, $discipline, $file) or die;
my $size = -s $fh;
# my $lines = 0;
my $bytes = 0;
my $start_time = Time::HiRes::time();
while (<$fh>)
{
use bytes;
# $lines += 1;
$bytes += length($_);
}
my $end_time = Time::HiRes::time();
close($fh);
print " $size on disk, $bytes in memory\n" if $bytes != $size;
my $secs = $end_time - $start_time;
# return ($secs, $size, $lines);
return ($secs, $size);
}
# ------------------------------------------------------------------
sub bench2_native { return bench2_common(@_, "Slurp-into-scalar, defau
+lt layers", "<"); }
sub bench2_raw_perlio { return bench2_common(@_, "Slurp-into-scalar, :
+raw:perlio", "<:raw:perlio"); }
sub bench2_raw_perlio_utf8 { return bench2_common(@_, "Slurp-into-scal
+ar, :raw:perlio:utf8", "<:raw:perlio:utf8"); }
# Read whole file with <fh>
sub bench2_common
{
my ($file, $prompt, $discipline) = @_;
print "\n$prompt\n";
open(my $fh, $discipline, $file) or die;
my $size = -s $fh;
local $/ = undef;
my $buf = "";
vec($buf, $size, 8) = 0;
my $start_time = Time::HiRes::time();
$buf = <$fh>;
my $end_time = Time::HiRes::time();
close($fh);
my $bufsize = length($buf);
# die "file is $size but got $bufsize" unless $bufsize == $size;
print " $size on disk, $bufsize in memory\n" if $bufsize != $size
+;
my $secs = $end_time - $start_time;
return ($secs, $size);
}
# ------------------------------------------------------------------
# Read whole file with sysopen/sysread
sub bench3
{
my ($file) = @_;
print "\n";
print "Slurp into scalar with sysopen/sysread (single read)\n";
sysopen(my $fh, $file, Fcntl::O_RDONLY | Fcntl::O_BINARY) or die;
my $size = -s $fh;
local $/ = undef;
my $buf = "";
vec($buf, $size, 8) = 0;
my $start_time = Time::HiRes::time();
my $count = sysread($fh, $buf, $size);
my $end_time = Time::HiRes::time();
die "read error: $!" unless defined($count) && $count == $size;
close($fh);
my $bufsize = length($buf);
die "file is $size but got $bufsize" unless $bufsize == $size;
my $secs = $end_time - $start_time;
return ($secs, $size);
}
|