http://www.perlmonks.org?node_id=162034

I spent far more time on this than I really cared to, but since I went to a lot of trouble, I figure it's worth sharing the results with you.

I needed to read the last N lines of a file, and wanted a perlish solution, i.e. anything would be better than my @lines = `/usr/bin/tail -$count $file`. I buzzed the chatterbox, received a few suggestions, prodded the search box, and found out some interesting things.

My very first action was to download File::Tail, which certainly has the right name, but as it turns out it is an implementation of tail -f, that is, reading the most recently added lines to an ever-growing file, forever. (Indeed, people wonder how to make it stop).

So...

I wrote my own version of tail, to get a feel for the problem (even though the plan in the first place was to avoid having to do that)... I believed that a push and shift approach of maintaining an array of the last N lines read would be too inefficient: too much time would be spent juggling things around. So I adopted the approach of having two arrays and alternating between them, throwing away the intermediate lines and then at the end fetching the last N records. It came out looking like this:

my $limit = 100; my $file = 'big.file'; sub tail { my $lim = $limit; my @alpha; my @beta; my $current = \@alpha; open( IN, $file ) or die "Cannot open $file for input: $!\n"; while( <IN> ) { push @$current, $_; if( scalar @$current > $limit ) { if( $current == \@alpha ) { @beta = (); $current = \@beta; } else { @alpha = (); $current = \@alpha; } } } close IN; return @{$current == \@alpha ? \@beta : \@alpha}[-($limit - scalar + @$current)..-1], @$current; }

This works reasonably well for small files, but it scales miserably. And does not contain error checking to deal with the file containing less than the number of lines asked for -- but that's a minor issue :).

About this time people started to chime in with suggestions, the best one being File::ReadBackwards. I also found the snippet Last N lines from file (tail), but it has comparatively woeful performance, although someone mentioned how to force File::Tail not to block, thereby offering another approach. (update: broquaint suggests Tie::File as another avenue to explore. Maybe later :)

File::ReadBackwards, like the name says, reads files starting from the last line, and then working back to the first. So to get my lines in order, I could either push to an array and then reverse it at the end, or just unshift to an array. The module offers an object-oriented interface, or a tie interface. That gives me four approaches to play with, as well as File::Tail, two from the lastn snippet and my own.

For small files, the OO File::Backwards approach wins, closely tailed (heh) by the tie interface. My approach is not too shabby, and the File::Tail approach is more than twice as slow as File::Backwards. The lastn snippet approach is an order of magnitude or two slower.

The difference in the File::ReadBackwards tests between using push/reverse or unshift is lost in the noise.

For big files, my approach rapidly hits a brick wall. It starts paying dearly for the cost of lifting the entire file off disk. The rest of the approaches seek to the near the end of the file, which relieves the process of a large (and wasted) amount of I/O.

Conclusion

If you want to perform a tail -100 /var/adm/messagesin Perl, use a File::ReadBackwards object and save the lines in an array.

Benchmarks

I performed two different tests, one fetching the last 10 lines of a file containing 100 lines, and a second one fetching the last 100 lines of a file containing 721 994 lines (which is what I consider a real-world example). Note in the code that I have added a compile time constant to actually print out the results of the various approaches. And that showed up a problem with File::Tail in that it seems to have an off-by-one error. The last line is not read and so the lines are all shifted up by one with regards to the file. That could also be due to the fact that the purpose of File::Tail is being somewhat subverted, so I didn't really pursue the matter. Also note that the File::Tail example in a response to the lastn snippet is brokem, although it was a good enough basis to figure out what to do.

update: oh yeah, other weirdness about File::Tail I forgot to mention that I was reminded of re-reading the code. If I uncomment the $fh->closestatment in file_tail, the array @linescontains no lines. It has before the statement, but after... none. I really boggled at that one, but couldn't see any obvious errors in my code, and as I just dismissed it. But I don't have a good explanation.

On a small file

% perl tailbm -30 10 100.lines Benchmark: running f_rb_obj, f_rb_obj_u, f_rb_tie, f_rb_tie_u, file_ta +il, grinder, lastn, lastn_getc, each for at least 30 CPU seconds... f_rb_obj: 75 wallclock secs (28.61 usr + 2.71 sys = 31.32 CPU) @ 8 +07.15/s (n=25280) f_rb_obj_u: 70 wallclock secs (28.59 usr + 2.75 sys = 31.34 CPU) @ 8 +06.32/s (n=25270) f_rb_tie: 70 wallclock secs (29.45 usr + 2.37 sys = 31.82 CPU) @ 7 +19.96/s (n=22909) f_rb_tie_u: 74 wallclock secs (29.05 usr + 2.44 sys = 31.49 CPU) @ 7 +29.76/s (n=22980) file_tail: 112 wallclock secs (29.71 usr + 2.34 sys = 32.05 CPU) @ 2 +74.26/s (n=8790) grinder: 74 wallclock secs (28.38 usr + 2.77 sys = 31.15 CPU) @ 5 +66.52/s (n=17647) lastn: 69 wallclock secs (19.78 usr + 11.34 sys = 31.12 CPU) @ 2 +2.88/s (n=712) lastn_getc: 77 wallclock secs (18.88 usr + 13.17 sys = 32.05 CPU) @ 1 +9.78/s (n=634)

On a big file

% perl tailbm -30 100 721994.lines Benchmark: running f_rb_obj, f_rb_obj_u, f_rb_tie, f_rb_tie_u, file_ta +il, grinder, each for at least 30 CPU seconds... f_rb_obj: 54 wallclock secs (30.52 usr + 0.93 sys = 31.45 CPU) @ 1 +47.31/s (n=4633) f_rb_obj_u: 70 wallclock secs (30.24 usr + 0.83 sys = 31.07 CPU) @ 1 +44.19/s (n=4480) f_rb_tie: 69 wallclock secs (30.70 usr + 0.60 sys = 31.30 CPU) @ 1 +21.31/s (n=3797) f_rb_tie_u: 68 wallclock secs (29.96 usr + 0.53 sys = 30.49 CPU) @ 1 +18.99/s (n=3628) file_tail: 76 wallclock secs (31.02 usr + 0.83 sys = 31.85 CPU) @ 6 +0.60/s (n=1930) grinder: 121 wallclock secs (33.17 usr + 2.10 sys = 35.27 CPU) @ +0.09/s (n=3) (warning: too few iterations for a reliable count) lastn: 40 wallclock secs (20.15 usr + 10.80 sys = 30.95 CPU) @ 1 +9.48/s (n=603) lastn_getc: 42 wallclock secs (18.05 usr + 13.81 sys = 31.86 CPU) @ 1 +7.70/s (n=564)

Code

#! /usr/bin/perl -w use strict; use Benchmark; use File::ReadBackwards; use File::Tail; use constant VERIFY => 0; my $count = shift or die "Benchmark count not specified, try 1000 (ite +rs) or -10 (CPU secs).\n"; my $limit = shift or die "Number of lines not specified, try 100.\n"; my $file = shift or die "No filename specified on command-line.\n"; my @result; sub f_rb_obj { my $lim = $limit; my $bw = File::ReadBackwards->new( $file ) or die "can't read $fil +e: $!\n" ; my $line; my @lines; while( defined( my $line = $bw->readline ) ) { push @lines, $line; last if --$lim <= 0; } reverse @lines; } sub f_rb_tie { my $lim = $limit; tie *BW, 'File::ReadBackwards', $file or die "can't read $file: $! +\n" ; my @lines; while( <BW> ) { push @lines, $_; last if --$lim <= 0; } reverse @lines; } sub f_rb_obj_u { my $lim = $limit; my $bw = File::ReadBackwards->new( $file ) or die "can't read $fil +e: $!\n" ; my $line; my @lines; while( defined( my $line = $bw->readline ) ) { unshift @lines, $line; last if --$lim <= 0; } @lines; } sub f_rb_tie_u { my $lim = $limit; tie *BW, 'File::ReadBackwards', $file or die "can't read $file: $! +\n" ; my @lines; while( <BW> ) { unshift @lines, $_; last if --$lim <= 0; } @lines; } sub grinder { my $lim = $limit; my @alpha; my @beta; my $current = \@alpha; open( IN, $file ) or die "Cannot open $file for input: $!\n"; while( <IN> ) { push @$current, $_; if( scalar @$current > $limit ) { if( $current == \@alpha ) { @beta = (); $current = \@beta; } else { @alpha = (); $current = \@alpha; } } } close IN; return @{$current == \@alpha ? \@beta : \@alpha}[-($limit - scalar + @$current)..-1], @$current; } sub lastn { my $lines = $limit; my $fh; if (! open($fh, $file) ) { print "Can't open $file: $!"; return; } binmode($fh); sysseek($fh, 0, 2); # Seek to end my $nlcount=0; while($nlcount<$lines) { last unless sysseek($fh, -1, 1); sysread($fh, $_, 1, 0) || die; $nlcount++ if ( $_ eq "\n"); last if $nlcount==$lines; last unless (sysseek($fh, -1, 1)); } seek($fh, sysseek($fh, 0, 1), 0) || warn; my @lines = <$fh>; close $fh; @lines; } sub lastn_getc { my $lines = $limit; my $fh; if (! open($fh, $file) ) { print "Can't open $file: $!"; return; } binmode($fh); seek($fh, 0, 2); # Seek to end my $nlcount=0; while($nlcount<$lines) { last unless seek($fh, -1, 1); $_=getc($fh); die unless defined $_; $nlcount++ if ( $_ eq "\n"); last if $nlcount==$lines; last unless (seek($fh, -1, 1)); } my @lines = <$fh>; close $fh; @lines; } sub file_tail { my $fh = File::Tail->new(name=>$file,tail=>$limit); if( !defined $fh ) { die "Could not create File::Tail object on $file: $!\n"; } $fh->nowait(1); my @lines; local $" = ""; while( defined( my $line = $fh->read() )) { last unless $line; push @lines, $line; } # $fh->close; @lines; } if( VERIFY ) { local $" = ""; for my $test( qw/f_rb_obj f_rb_tie grinder lastn lastn_getc file_t +ail/ ) { warn "$test\n", eval( "$test()" ), "\n"; } exit; } timethese( $count, { 'f_rb_obj' => \&f_rb_obj, 'f_rb_tie' => \&f_rb_tie, 'f_rb_obj_u' => \&f_rb_obj_u, 'f_rb_tie_u' => \&f_rb_tie_u, 'grinder' => \&grinder, 'file_tail' => \&file_tail, 'lastn' => \&lastn, 'lastn_getc' => \&lastn_getc, }); __END__

Hmm, it always happens. Just before I was about to stumbit, I notice that I'm not closing the File::Backwards objects. Well, I'm not going to rerun the benchmarks all over again, and anyway, the file handles are being silently destroyed at the end of the routine, so all is well.

update: it's File::ReadBackwards, not File::Backwards. The code was right, the prose was wrong. Just another case of when the code and the documentation differ, trust the code.


print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'