Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Efficiency of seek

by salva (Canon)
on Aug 31, 2011 at 07:52 UTC ( [id://923375]=note: print w/replies, xml ) Need Help??


in reply to Efficiency of seek

does the time consumption of perl's seek is related with file size, or it's a constant time operation?
If the file fits in the OS cache and you have already read it on the preprocessing stage, seek is going to be a very cheap operation as all the information will be available from fast RAM.

When the file doesn't fit on the cache, randomly accessing it would be several orders of magnitude slower (per operation). In those cases, it is usually better to revert to an algorithm that uses a linear access pattern to the data even if in paper it is worse .

A classical example is removing duplicates from some data using a hash to detect duplicates (O(N)) or using sort to make duplicates contiguous and trivial to detect (O(NlogN)). When the data becomes too big to fit in the available RAM, the random data access pattern on the algorithm using the hash degrades its performance to the point of making it useless.

Update: Some code (Linux only, because of the non-portable disk cache cleaning stuff):

#!/usr/bin/perl; use strict; use warnings; use autodie; use Benchmark qw(timethis); use List::Util qw(shuffle); my $block_size = 8 * 1024; # 8 KB my $size = 128 * 1024 * 1024; # 128 MB my $blocks = $size / $block_size; my $fn = '/tmp/seekme.pl'; warn "file size: $size, block size: $block_size, blocks: $blocks\n"; warn "writing file...\n"; my $kb = '1' x $block_size; open my $fh, ">", $fn; print $fh $kb for (1..$blocks); close $fh; warn "file written\n"; sub clean_cache { warn "cleaning disk cache\n"; system "sync ; sudo echo 3 > /proc/sys/vm/drop_caches"; warn "disk cache clean\n" } my @order = shuffle 0..($blocks -1); my $buf; clean_cache(); warn "timing preload and random seek\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { read $fh, $buf, $block_size; } for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } }; warn "timing random seek with preloaded disk cache\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } }; clean_cache(); warn "timing random seek with empty disk cache\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } };
That on my computer...
file size: 134217728, block size: 8192, blocks: 16384 writing file... file written cleaning disk cache disk cache clean timing preload and random seek timethis 1: 6 wallclock secs ( 1.34 usr + 0.36 sys = 1.70 CPU) @ 0 +.59/s (n=1) timing random seek with preloaded disk cache timethis 1: 1 wallclock secs ( 0.88 usr + 0.06 sys = 0.94 CPU) @ 1 +.06/s (n=1) cleaning disk cache disk cache clean timing random seek with empty disk cache timethis 1: 93 wallclock secs ( 1.29 usr + 1.15 sys = 2.44 CPU) @ 0 +.41/s (n=1)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://923375]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-03-30 09:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found