Looking for a better way to get the number of lines in a file...

coec has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

(RhetTbull) Re: Looking for a better way to get the number of lines in a file...
by RhetTbull (Curate) on Dec 10, 2001 at 18:06 UTC

perl variable

$.

while (<FILE>){};
print "length = $.";
[download]

Update:In light of gbarr's post below I did some benchmarking. gbarr is correct that snarfing the file in chunks is much faster. I tested quite a few large files (>1 million lines) and found that in most cases, reading 100K chunks and counting linefeeds is several times faster than reading line by line. In a few cases the two approaches were close. So, I definitely recommend gbarr's solution. Benchmark code follows:

#!/usr/bin/perl

use warnings;
use strict;

use Benchmark;

timethese(100, {
        'line_by_line' => q{
                open(INFILE,'test.dat') or die "open: $!";
                while(<INFILE>){};
                close(INFILE);
        },
        'chunk' => q{
                $count = 0;
                open(INFILE,'test.dat') or die "open: $!";
                local $/=\1024000;
                while(<INFILE>) { $count += tr/\n// }
                close(INFILE);
        }
});
[download]

Benchmark: timing 100 iterations of chunk, line_by_line...
     chunk:  3 wallclock secs ( 1.58 usr +  0.70 sys =  2.28 CPU) @ 43
+.80/s (n=100)
     line_by_line: 12 wallclock secs ( 9.54 usr +  0.72 sys = 10.27 CP
+U) @  9.74/s (n=100)
[download]

Update #2 After further playing around and looking at other nodes on the same topic it seems that sysread is even faster still. Doing new benchmarks utilizing the following code

open INFILE,'test.dat' or die "open: $!";
my $count = 0;
while (sysread(INFILE,$_,102400)) { $count += tr/\n//; }
[download]

/home/rhet/misc> ./testfcount.pl
Benchmark: timing 100 iterations of chunk, line_by_line, sysread...
     chunk: 12 wallclock secs ( 8.00 usr +  3.52 sys = 11.53 CPU) @  8
+.68/s (n=100)
     line_by_line: 154 wallclock secs (149.20 usr +  3.29 sys = 152.50
+ CPU) @  0.66/s (n=100)
     sysread:  6 wallclock secs ( 4.09 usr +  1.62 sys =  5.71 CPU) @ 
+17.52/s (n=100)
[download]

[reply]
[d/l]
[select]

Re: Re: Looking for a better way to get the number of lines in a file...

by gbarr (Monk) on Dec 10, 2001 at 22:20 UTC

local $/=\102400;
while(<FILE>) { $count += tr/\n// }
[download]

This example reads in chunks of 100K.

[reply]
[d/l]

Re: Looking for a better way to get ...
by stefan k (Curate) on Dec 10, 2001 at 17:58 UTC

Golf

A quick search finds How do I count the number of lines in a file? and How do I find the total number of lines in a file?, too. Maybe next time you'll do a little research first :-)

Regards... Stefan

you begin bashing the string with a +42 regexp of confusion

[reply]

Re (tilly) 1: Looking for a better way to get the number of lines in a file...
by tilly (Archbishop) on Dec 10, 2001 at 18:36 UTC

The first is that you are using global variables. A brush with strict.pm can help you avoid using them, which will lessen the chances that subroutines will interfere with each other.

The second is that Perl has interpolation so that you don't need to concatenate. It is simpler to write "FILE: $FILE\n" and most people will find it somewhat easier.

Thirdly from comprehension studies, people's comprehension of code is significantly improved when your indent is in the range 2-4. You are using 8 (full tab-stop) and that is going to make any complex logic substantially harder to follow.

And last but not least, ALL CAPS IS SHOUTING! Anyone who has been around on the net for a while will be used to that convention, and it makes your code very hard to read for any length of time. Yes, this is an "important irrelevancy", something that is important to be consistent on, but which choice you make is fundamentally irrelevant. However the connection between capitals and shouting is rather widely accepted.

Oh, and a final book recommendation. Code Complete by Steve McConnell is a good place to learn a lot about what makes up good (procedural) programming style. It is a classic. Read it.

[reply]
[d/l]

Re: Looking for a better way to get the number of lines in a file...
by Biker (Priest) on Dec 10, 2001 at 18:26 UTC

Well, what's a line?

The first line starts by the beginning of the file and ends at the first LF (or CRLF depending upon your operating system).
Most of the lines will start just after the preceeding line termination and continue to the next line termination (LF or CRLF).
Finally, one line may continue until end of file. This is not really clean, since for a 'true text file', the last byte(s) in the file should be a line terminator.
All of the above for a file that contains at least four lines.

Now, the problem has become to count the number or line terminators in the file and potentially add one to that. ("Until end of file.")

To count occurences of something in a (non indexed) file you will have to read it from the beginning until the end.
You could read the file byte by byte and verify each byte if it's a CR or an LF. If it's a CR, will the next byte be LF? (Assuming DOS/Windows as an OS.)
Why not let the file read figure out what is a line terminator for you? Then read one line at a time, count it and then throw it away.
Using $. is fine if you make some assumptions. $. is reset when the file handle is closed. From perlvar: Because <> never does an explicit close, line numbers increase across ARGV files. Take care.

There are smarter ways of doing it. But then you'll have to start making assumptions on how a line ends. And that the last line actually is correctly terminated. (Well, it should. But...)

f--k the world!!!!
/dev/world has reached maximal mount count, check forced.

[reply]

Re: Looking for a better way to get the number of lines in a file...
by strat (Canon) on Dec 10, 2001 at 19:19 UTC

my $length = `wc -l $filename`; chomp($length);
[download]

Best regards,
perl -e "print a|r,p|d=>b|p=>chr 3**2 .7=>t and t"

[reply]
[d/l]
[select]

Re: Re: Looking for a better way to get the number of lines in a file...

by RedDog (Pilgrim) on Dec 11, 2001 at 05:15 UTC

my $string=`wc -l $filename`;
my ($file,$length)=split " ",$filename;
chomp($length);
[download]

...We will have peace, when you and all your works have perished -- and the works of your dark master to whom you would deliver us. You are a liar, Saruman, and a corrupter of men's hearts. -- Theoden in The Two Towers --

[reply]
[d/l]

Re3: Looking for a better way to get the number of lines in a file...

by blakem (Monsignor) on Dec 11, 2001 at 05:56 UTC

$string

$filename

wc

If I were going to rewrite this snippet and was forced to use wc, I might do something like:

#!/usr/bin/perl -wT
use strict;
%ENV = (PATH => '/bin:/usr/bin');  # give us a happy enviornment

my $filename = '/etc/services';    # filename to be checked

my $length = do {
  my $string = `wc -l $filename`;  # get output of wc 
  die "wc error $?" if $?;         # die if wc chokes for some reason
  no warnings 'numeric';           # turn off a pesky warning ;-)
  $string + 0;                     # numerify it with +0
};

print "length = $length\n";        # "length = 331" on my machine
[download]

Update:

 ($string =~ /\d+/g)[0];
[download]

my $length = (`wc -l $filename` =~ /\d+/g)[0];  # get output of wc
die "wc error $?" if $?;        # die if wc chokes for some reason
[download]

-Blake

[reply]
[d/l]
[select]

Re: Looking for a better way to get the number of lines in a file...
by simul (Novice) on Mar 13, 2011 at 12:06 UTC

posted Perl code on my blog


Problems? Is your data what you think it is?
	PerlMonks