does the time consumption of perl's seek is related with file size, or it's a constant time operation?
If the file fits in the OS cache and you have already read it on the preprocessing stage, seek is going to be a very cheap operation as all the information will be available from fast RAM.
When the file doesn't fit on the cache, randomly accessing it would be several orders of magnitude slower (per operation). In those cases, it is usually better to revert to an algorithm that uses a linear access pattern to the data even if in paper it is worse .
A classical example is removing duplicates from some data using a hash to detect duplicates (O(N)) or using sort to make duplicates contiguous and trivial to detect (O(NlogN)). When the data becomes too big to fit in the available RAM, the random data access pattern on the algorithm using the hash degrades its performance to the point of making it useless.
Update: Some code (Linux only, because of the non-portable disk cache cleaning stuff):
#!/usr/bin/perl;
use strict;
use warnings;
use autodie;
use Benchmark qw(timethis);
use List::Util qw(shuffle);
my $block_size = 8 * 1024; # 8 KB
my $size = 128 * 1024 * 1024; # 128 MB
my $blocks = $size / $block_size;
my $fn = '/tmp/seekme.pl';
warn "file size: $size, block size: $block_size, blocks: $blocks\n";
warn "writing file...\n";
my $kb = '1' x $block_size;
open my $fh, ">", $fn;
print $fh $kb for (1..$blocks);
close $fh;
warn "file written\n";
sub clean_cache {
warn "cleaning disk cache\n";
system "sync ; sudo echo 3 > /proc/sys/vm/drop_caches";
warn "disk cache clean\n"
}
my @order = shuffle 0..($blocks -1);
my $buf;
clean_cache();
warn "timing preload and random seek\n";
timethis 1 => sub {
open my $fh, '<', $fn;
for (@order) {
read $fh, $buf, $block_size;
}
for (@order) {
seek($fh, $_ * $block_size, 0);
read $fh, $buf, $block_size;
}
};
warn "timing random seek with preloaded disk cache\n";
timethis 1 => sub {
open my $fh, '<', $fn;
for (@order) {
seek($fh, $_ * $block_size, 0);
read $fh, $buf, $block_size;
}
};
clean_cache();
warn "timing random seek with empty disk cache\n";
timethis 1 => sub {
open my $fh, '<', $fn;
for (@order) {
seek($fh, $_ * $block_size, 0);
read $fh, $buf, $block_size;
}
};
That on my computer...
file size: 134217728, block size: 8192, blocks: 16384
writing file...
file written
cleaning disk cache
disk cache clean
timing preload and random seek
timethis 1: 6 wallclock secs ( 1.34 usr + 0.36 sys = 1.70 CPU) @ 0
+.59/s (n=1)
timing random seek with preloaded disk cache
timethis 1: 1 wallclock secs ( 0.88 usr + 0.06 sys = 0.94 CPU) @ 1
+.06/s (n=1)
cleaning disk cache
disk cache clean
timing random seek with empty disk cache
timethis 1: 93 wallclock secs ( 1.29 usr + 1.15 sys = 2.44 CPU) @ 0
+.41/s (n=1)