Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Performing a tail(1) in Perl (reading the last N lines of a file)

by grinder (Bishop)
on Apr 25, 2002 at 16:56 UTC ( [id://162034]=perlmeditation: print w/replies, xml ) Need Help??

I spent far more time on this than I really cared to, but since I went to a lot of trouble, I figure it's worth sharing the results with you.

I needed to read the last N lines of a file, and wanted a perlish solution, i.e. anything would be better than my @lines = `/usr/bin/tail -$count $file`. I buzzed the chatterbox, received a few suggestions, prodded the search box, and found out some interesting things.

My very first action was to download File::Tail, which certainly has the right name, but as it turns out it is an implementation of tail -f, that is, reading the most recently added lines to an ever-growing file, forever. (Indeed, people wonder how to make it stop).

So...

I wrote my own version of tail, to get a feel for the problem (even though the plan in the first place was to avoid having to do that)... I believed that a push and shift approach of maintaining an array of the last N lines read would be too inefficient: too much time would be spent juggling things around. So I adopted the approach of having two arrays and alternating between them, throwing away the intermediate lines and then at the end fetching the last N records. It came out looking like this:

my $limit = 100; my $file = 'big.file'; sub tail { my $lim = $limit; my @alpha; my @beta; my $current = \@alpha; open( IN, $file ) or die "Cannot open $file for input: $!\n"; while( <IN> ) { push @$current, $_; if( scalar @$current > $limit ) { if( $current == \@alpha ) { @beta = (); $current = \@beta; } else { @alpha = (); $current = \@alpha; } } } close IN; return @{$current == \@alpha ? \@beta : \@alpha}[-($limit - scalar + @$current)..-1], @$current; }

This works reasonably well for small files, but it scales miserably. And does not contain error checking to deal with the file containing less than the number of lines asked for -- but that's a minor issue :).

About this time people started to chime in with suggestions, the best one being File::ReadBackwards. I also found the snippet Last N lines from file (tail), but it has comparatively woeful performance, although someone mentioned how to force File::Tail not to block, thereby offering another approach. (update: broquaint suggests Tie::File as another avenue to explore. Maybe later :)

File::ReadBackwards, like the name says, reads files starting from the last line, and then working back to the first. So to get my lines in order, I could either push to an array and then reverse it at the end, or just unshift to an array. The module offers an object-oriented interface, or a tie interface. That gives me four approaches to play with, as well as File::Tail, two from the lastn snippet and my own.

For small files, the OO File::Backwards approach wins, closely tailed (heh) by the tie interface. My approach is not too shabby, and the File::Tail approach is more than twice as slow as File::Backwards. The lastn snippet approach is an order of magnitude or two slower.

The difference in the File::ReadBackwards tests between using push/reverse or unshift is lost in the noise.

For big files, my approach rapidly hits a brick wall. It starts paying dearly for the cost of lifting the entire file off disk. The rest of the approaches seek to the near the end of the file, which relieves the process of a large (and wasted) amount of I/O.

Conclusion

If you want to perform a tail -100 /var/adm/messagesin Perl, use a File::ReadBackwards object and save the lines in an array.

Benchmarks

I performed two different tests, one fetching the last 10 lines of a file containing 100 lines, and a second one fetching the last 100 lines of a file containing 721 994 lines (which is what I consider a real-world example). Note in the code that I have added a compile time constant to actually print out the results of the various approaches. And that showed up a problem with File::Tail in that it seems to have an off-by-one error. The last line is not read and so the lines are all shifted up by one with regards to the file. That could also be due to the fact that the purpose of File::Tail is being somewhat subverted, so I didn't really pursue the matter. Also note that the File::Tail example in a response to the lastn snippet is brokem, although it was a good enough basis to figure out what to do.

update: oh yeah, other weirdness about File::Tail I forgot to mention that I was reminded of re-reading the code. If I uncomment the $fh->closestatment in file_tail, the array @linescontains no lines. It has before the statement, but after... none. I really boggled at that one, but couldn't see any obvious errors in my code, and as I just dismissed it. But I don't have a good explanation.

On a small file

% perl tailbm -30 10 100.lines Benchmark: running f_rb_obj, f_rb_obj_u, f_rb_tie, f_rb_tie_u, file_ta +il, grinder, lastn, lastn_getc, each for at least 30 CPU seconds... f_rb_obj: 75 wallclock secs (28.61 usr + 2.71 sys = 31.32 CPU) @ 8 +07.15/s (n=25280) f_rb_obj_u: 70 wallclock secs (28.59 usr + 2.75 sys = 31.34 CPU) @ 8 +06.32/s (n=25270) f_rb_tie: 70 wallclock secs (29.45 usr + 2.37 sys = 31.82 CPU) @ 7 +19.96/s (n=22909) f_rb_tie_u: 74 wallclock secs (29.05 usr + 2.44 sys = 31.49 CPU) @ 7 +29.76/s (n=22980) file_tail: 112 wallclock secs (29.71 usr + 2.34 sys = 32.05 CPU) @ 2 +74.26/s (n=8790) grinder: 74 wallclock secs (28.38 usr + 2.77 sys = 31.15 CPU) @ 5 +66.52/s (n=17647) lastn: 69 wallclock secs (19.78 usr + 11.34 sys = 31.12 CPU) @ 2 +2.88/s (n=712) lastn_getc: 77 wallclock secs (18.88 usr + 13.17 sys = 32.05 CPU) @ 1 +9.78/s (n=634)

On a big file

% perl tailbm -30 100 721994.lines Benchmark: running f_rb_obj, f_rb_obj_u, f_rb_tie, f_rb_tie_u, file_ta +il, grinder, each for at least 30 CPU seconds... f_rb_obj: 54 wallclock secs (30.52 usr + 0.93 sys = 31.45 CPU) @ 1 +47.31/s (n=4633) f_rb_obj_u: 70 wallclock secs (30.24 usr + 0.83 sys = 31.07 CPU) @ 1 +44.19/s (n=4480) f_rb_tie: 69 wallclock secs (30.70 usr + 0.60 sys = 31.30 CPU) @ 1 +21.31/s (n=3797) f_rb_tie_u: 68 wallclock secs (29.96 usr + 0.53 sys = 30.49 CPU) @ 1 +18.99/s (n=3628) file_tail: 76 wallclock secs (31.02 usr + 0.83 sys = 31.85 CPU) @ 6 +0.60/s (n=1930) grinder: 121 wallclock secs (33.17 usr + 2.10 sys = 35.27 CPU) @ +0.09/s (n=3) (warning: too few iterations for a reliable count) lastn: 40 wallclock secs (20.15 usr + 10.80 sys = 30.95 CPU) @ 1 +9.48/s (n=603) lastn_getc: 42 wallclock secs (18.05 usr + 13.81 sys = 31.86 CPU) @ 1 +7.70/s (n=564)

Code

#! /usr/bin/perl -w use strict; use Benchmark; use File::ReadBackwards; use File::Tail; use constant VERIFY => 0; my $count = shift or die "Benchmark count not specified, try 1000 (ite +rs) or -10 (CPU secs).\n"; my $limit = shift or die "Number of lines not specified, try 100.\n"; my $file = shift or die "No filename specified on command-line.\n"; my @result; sub f_rb_obj { my $lim = $limit; my $bw = File::ReadBackwards->new( $file ) or die "can't read $fil +e: $!\n" ; my $line; my @lines; while( defined( my $line = $bw->readline ) ) { push @lines, $line; last if --$lim <= 0; } reverse @lines; } sub f_rb_tie { my $lim = $limit; tie *BW, 'File::ReadBackwards', $file or die "can't read $file: $! +\n" ; my @lines; while( <BW> ) { push @lines, $_; last if --$lim <= 0; } reverse @lines; } sub f_rb_obj_u { my $lim = $limit; my $bw = File::ReadBackwards->new( $file ) or die "can't read $fil +e: $!\n" ; my $line; my @lines; while( defined( my $line = $bw->readline ) ) { unshift @lines, $line; last if --$lim <= 0; } @lines; } sub f_rb_tie_u { my $lim = $limit; tie *BW, 'File::ReadBackwards', $file or die "can't read $file: $! +\n" ; my @lines; while( <BW> ) { unshift @lines, $_; last if --$lim <= 0; } @lines; } sub grinder { my $lim = $limit; my @alpha; my @beta; my $current = \@alpha; open( IN, $file ) or die "Cannot open $file for input: $!\n"; while( <IN> ) { push @$current, $_; if( scalar @$current > $limit ) { if( $current == \@alpha ) { @beta = (); $current = \@beta; } else { @alpha = (); $current = \@alpha; } } } close IN; return @{$current == \@alpha ? \@beta : \@alpha}[-($limit - scalar + @$current)..-1], @$current; } sub lastn { my $lines = $limit; my $fh; if (! open($fh, $file) ) { print "Can't open $file: $!"; return; } binmode($fh); sysseek($fh, 0, 2); # Seek to end my $nlcount=0; while($nlcount<$lines) { last unless sysseek($fh, -1, 1); sysread($fh, $_, 1, 0) || die; $nlcount++ if ( $_ eq "\n"); last if $nlcount==$lines; last unless (sysseek($fh, -1, 1)); } seek($fh, sysseek($fh, 0, 1), 0) || warn; my @lines = <$fh>; close $fh; @lines; } sub lastn_getc { my $lines = $limit; my $fh; if (! open($fh, $file) ) { print "Can't open $file: $!"; return; } binmode($fh); seek($fh, 0, 2); # Seek to end my $nlcount=0; while($nlcount<$lines) { last unless seek($fh, -1, 1); $_=getc($fh); die unless defined $_; $nlcount++ if ( $_ eq "\n"); last if $nlcount==$lines; last unless (seek($fh, -1, 1)); } my @lines = <$fh>; close $fh; @lines; } sub file_tail { my $fh = File::Tail->new(name=>$file,tail=>$limit); if( !defined $fh ) { die "Could not create File::Tail object on $file: $!\n"; } $fh->nowait(1); my @lines; local $" = ""; while( defined( my $line = $fh->read() )) { last unless $line; push @lines, $line; } # $fh->close; @lines; } if( VERIFY ) { local $" = ""; for my $test( qw/f_rb_obj f_rb_tie grinder lastn lastn_getc file_t +ail/ ) { warn "$test\n", eval( "$test()" ), "\n"; } exit; } timethese( $count, { 'f_rb_obj' => \&f_rb_obj, 'f_rb_tie' => \&f_rb_tie, 'f_rb_obj_u' => \&f_rb_obj_u, 'f_rb_tie_u' => \&f_rb_tie_u, 'grinder' => \&grinder, 'file_tail' => \&file_tail, 'lastn' => \&lastn, 'lastn_getc' => \&lastn_getc, }); __END__

Hmm, it always happens. Just before I was about to stumbit, I notice that I'm not closing the File::Backwards objects. Well, I'm not going to rerun the benchmarks all over again, and anyway, the file handles are being silently destroyed at the end of the routine, so all is well.

update: it's File::ReadBackwards, not File::Backwards. The code was right, the prose was wrong. Just another case of when the code and the documentation differ, trust the code.


print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'

Replies are listed 'Best First'.
Re: Performing a tail(1) in Perl (reading the last N lines of a file)
by Chmrr (Vicar) on Apr 25, 2002 at 17:23 UTC

    I'd seen Tie::File touted recently, so I decided to see how it stood up. I added the following subroutine:

    sub tie_file { my @lines; tie @lines, 'Tie::File', $file or die "$!"; map {"$_\n"} @lines[-$limit..-1]; }

    Pretty much as simple as it gets. Unfortunatly, it's not quite so hot on the performance:

    [chmrr@supox chmrr]$ wc -l /var/log/messages 4442 /var/log/messages [chmrr@supox chmrr]$ perl -w tailbm.pl -10 100 /var/log/messages Benchmark: running f_rb_obj, f_rb_obj_u, f_rb_tie, f_rb_tie_u, file_ta +il, grinder, lastn, lastn_getc, tie_file, each for at least 10 CPU se +conds... f_rb_obj: 11 wallclock secs (10.35 usr + 0.22 sys = 10.57 CPU) @ 66 +2.54/s (n=7003) f_rb_obj_u: 11 wallclock secs (10.54 usr + 0.12 sys = 10.66 CPU) @ 65 +0.19/s (n=6931) f_rb_tie: 11 wallclock secs (10.48 usr + 0.20 sys = 10.68 CPU) @ 54 +5.13/s (n=5822) f_rb_tie_u: 12 wallclock secs (10.39 usr + 0.16 sys = 10.55 CPU) @ 53 +7.25/s (n=5668) file_tail: 12 wallclock secs (10.60 usr + 0.14 sys = 10.74 CPU) @ 33 +3.71/s (n=3584) grinder: 10 wallclock secs (10.49 usr + 0.07 sys = 10.56 CPU) @ 26 +.23/s (n=277) lastn: 12 wallclock secs ( 7.79 usr + 2.63 sys = 10.42 CPU) @ 26 +.97/s (n=281) lastn_getc: 11 wallclock secs ( 8.83 usr + 1.46 sys = 10.29 CPU) @ 28 +.38/s (n=292) tie_file: 11 wallclock secs (10.15 usr + 0.12 sys = 10.27 CPU) @ 3 +.89/s (n=40)

    Well, it's good to know, at least.

    Update: Heh -- I started doing this test a'fore broquaint posted; great minds think alike, eh?

    Update 2: On a whim, I decided to test:

    sub backticks { split /$/m, `tail -$limit $file`; }

    ..as well. Even given the overhead of the shell, it's still darn fast:

    [chmrr@supox chmrr]$ perl -w tailbm.pl 500 100 /var/log/messages Benchmark: timing 500 iterations of backticks, f_rb_obj, f_rb_obj_u, f +_rb_tie, f_rb_tie_u, file_tail, grinder, lastn, lastn_getc, tie_file. +.. backticks: 1 wallclock secs ( 0.43 usr 0.16 sys + 0.61 cusr 0.30 +csys = 1.50 CPU) @ 847.46/s (n=500) f_rb_obj: 1 wallclock secs ( 0.71 usr + 0.04 sys = 0.75 CPU) @ 66 +6.67/s (n=500) f_rb_obj_u: 1 wallclock secs ( 0.77 usr + 0.00 sys = 0.77 CPU) @ 64 +9.35/s (n=500) f_rb_tie: 1 wallclock secs ( 0.89 usr + 0.02 sys = 0.91 CPU) @ 54 +9.45/s (n=500) f_rb_tie_u: 1 wallclock secs ( 0.91 usr + 0.02 sys = 0.93 CPU) @ 53 +7.63/s (n=500) file_tail: 1 wallclock secs ( 1.47 usr + 0.01 sys = 1.48 CPU) @ 33 +7.84/s (n=500) grinder: 19 wallclock secs (18.93 usr + 0.22 sys = 19.15 CPU) @ 26 +.11/s (n=500) lastn: 19 wallclock secs (14.20 usr + 4.34 sys = 18.54 CPU) @ 26 +.97/s (n=500) lastn_getc: 20 wallclock secs (15.09 usr + 2.64 sys = 17.73 CPU) @ 28 +.20/s (n=500) tie_file: 143 wallclock secs (126.62 usr + 1.81 sys = 128.43 CPU) @ + 3.89/s (n=500)

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

      Says Chmrr:
      Tie::File ... Pretty much as simple as it gets. Unfortunatly, it's not quite so hot on the performance:
      Not too surprising, because unlike most of the methods you benchmarked, Tie::File does not read the file backwards. It reads it forwards, starting at the beginning, and constructs some data structures along the way. For a long file like /var/adm/messages, this will take some time, because it grovels over the entire file.

      The payoff would come if you then asked it to tell you what was on line 12,345, which it would do instantly---or if you asked it to modify line 12,345, which the other modules won't do at all.

      --
      Mark Dominus
      Perl Paraphernalia

        Have you considered modifying File::Tail to defer reading the file such that if the only subscripts ever given to the tied array are negative, it reads the file backwards? Not asking for the feature; just throwing out ideas.
Re: Performing a tail(1) in Perl (reading the last N lines of a file)
by KM (Priest) on Apr 25, 2002 at 17:03 UTC
      I'm aware of the PPT project, but to my mind they are shell tools, and do not have a modul(e|ar) interface so you can't "embed" them in your script, you have to backtick them, just like the original tail(1). I could be wrong about that though.

      (A few moments later) hmmm no, it still looks like you can only run it as a child process, which is what I wanted to avoid. There is a sub named print_tail, which, according to its comments "Prints the tail of a file according to the options passed on the command line". I don't see a function that returns an array of scalars, which is what I was after.


      print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'
        But, it does what you want (plus more). You sort of took a spoke of a wheel, and rewrote the spoke :) Your code isn't any more modular than the one from PPT. Point being that there is good code written to do what you wanted, but instead of using it (embedding what you needed), or modularizing it you redid it. I'm not griping, but when you show a lot of benchmarks you obviously took time to do it multiple ways to see which is fastest (fast ne best) as opposed to taking the time to make a current wheel more useful. But, like I said before, seems like you had a good exercise :)

        Cheers,
        KM

      I'm not able to find this thread -the Perl Power Tools version of tail(1).Could some one please repost it?I want to find how to print last n lines in the most efficient way. I should not read entire file to print the last n lines.

        CPAN has it: tail in the ppt distribution. Also, a simple Google search for perl tail will also bring you to CPAN. Do some work yourself.

Re: Performing a tail(1) in Perl (reading the last N lines of a file)
by Fletch (Bishop) on Apr 25, 2002 at 17:25 UTC

    Your lastn is on the right track, but you're probably causing too much overhead by reading a byte at a time. Consider something like this (python-y pseudocode; I had a real implementation once at $job[-2] but I don't remember what I did with it):

    BLOCKSIZE = 1024 (or so) lines = 0 ; lines_wanted = 10 seek to end of file while lines < lines_wanted seek back BLOCKSIZE bytes read block into buffer pos = end of buffer while pos > 0 use rindex to find \n starting at pos if found lines++ pos = index of \n else last

    Update: You of course need error checking on seeks to make sure that you don't try and go back past the beginning of the file. And you would want to remember where the buffer starts so that you can use that (plus pos) to determine where to seek to when you start reading forwards.

Re: Performing a tail(1) in Perl (reading the last N lines of a file)
by broquaint (Abbot) on Apr 25, 2002 at 17:18 UTC
    I'm not sure how this benchmarks, but it is a little more succinct ;-)
    use Tie::File; my $length = shift @ARGV; my $file = shift @ARGV; tie(my @file, "Tie::File", $file, autochomp => 0) or die("ack - $!"); print @file[$#file - $length + 1 .. $#file];
    _________
    broquaint

    Update: fixed slice code as it was off by one (as noted by grinder)

      Says broquaint:
      I'm not sure how this benchmarks, but it is a little more succinct ;-)
      use Tie::File; ... print @file[$#file - $length + 1 .. $#file];
      It should benchmark OK, although not as well as a special-purpose backward-file-reader. But I agree with you about the succinctness. I mostly followed up to point out that you can do even better:
      print @file[-$length .. -1];
      Hope this helps.

      --
      Mark Dominus
      Perl Paraphernalia

Re: Performing a tail(1) in Perl (reading the last N lines of a file)
by massa (Hermit) on Jul 18, 2008 at 00:21 UTC
    I would benchmark your "juggling around" solution. It seems to me that the inbuild PerlIO buffering could make it really efficient compared to the other solutions.
    my ($file, $limit) = ('/var/log/messages', 100); open my $f, '<', $file or die; my @lines; $#lines = $limit; while( <$f> ) { shift @lines if @lines >= $limit; push @lines, $_ } print @lines
    []s, HTH, Massa
      Here's a subroutine that does what you want, and optionally trims the file as well. It assumes a line length of 72 bytes and goes back number of lines * 72. If it doesn't get enough lines it goes back another block. This is quite efficient especially when you only want a few lines at the end of a file, even for huge files. It's only 'slow' when you want 1000's of lines
      ########################################################### sub read_log { my $logfile=shift; my $lines_wanted=shift; my $TRIM=shift; my $lines_found=0; my $BLOCK=$lines_wanted * 72; #Assume 72 chars (bytes) per line my $i=0; my @lines=(); my $GO_BACK=$BLOCK; my $SEEK_RESULT=1; if (open(LOG,$logfile) ) { ## Stop if we have enough lines, or go back past the start while ( $lines_found < $lines_wanted and $SEEK_RESULT) { $SEEK_RESULT=seek LOG,-$GO_BACK,2; # Goto back approx request +ed lines <LOG>; # Chuck the first line remnant @lines=<LOG>; # Get the rest of the lines $lines_found=scalar (@lines); # Count the lines $GO_BACK+=$BLOCK; $i++; } close LOG; # If too many lines, just splice the array my $diff=$lines_found - $lines_wanted; splice @lines,0,$diff if $diff; trim_file($logfile,\@lines) if $TRIM; } else { print "Couldn't open $logfile: $!<BR>"; } } ########################################################### # Takes a file name and array of lines, backs up the existing file # and trims it. sub trim_file { my $file =shift || die "No filename passed to trim_file()\n"; my $lines=shift || die "No lines passed to trim_file()\n"; my $backup="$file.bak"; unlink $backup if -e $backup; rename $file,$backup; open (OUT,">$file") or die "Couldn't create $file:$!\n"; print OUT foreach @$lines; close OUT; }
Re: Performing a tail(1) in Perl (reading the last N lines of a file)
by Caio (Acolyte) on Aug 29, 2011 at 20:26 UTC
    Hello all, I was myself having some trouble to get information from the last lines of a file without having to go through the whole thing to get the, and stumbled upon this topic, and while reading I had the following idea:
    open (FILE, $file); my @file = <FILE>; close FILE; for (my $count = ( (scalar @file) -1); $count >= 0; $count--){ #my $line = $file[count]; optional # do what you gotta do # to end the loop abruptly, just do: $count = -1; # This can be done from an else (if you are searching # patters with if's. #Or you can use another, increasing counter, so you can # choose the number of lines to look at, say 12, then: # if ($count_2 > 12) {$count = -1} }
    I'd like advice and critics on the code above, what say you Monks? ps: I'm still a newbie, and haven't tried the code for time or memory use(Don't know how).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://162034]
Approved by broquaint
Front-paged by Dominus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-19 06:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found