Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

(RhetTbull) Re: Looking for a better way to get the number of lines in a file...

by RhetTbull (Curate)
on Dec 10, 2001 at 18:06 UTC ( #130658=note: print w/ replies, xml ) Need Help??


in reply to Looking for a better way to get the number of lines in a file...

The special perl variable $. tells you the current line of the file you're reading. Hence, at the end of the file, it gives you the total number of lines:

while (<FILE>){}; print "length = $.";
As an aside, it's typical in perl to reserve all uppercase variable names for file handles and such and not ordinary variables as you do. IMHO it makes your code easier to read.

Update:In light of gbarr's post below I did some benchmarking. gbarr is correct that snarfing the file in chunks is much faster. I tested quite a few large files (>1 million lines) and found that in most cases, reading 100K chunks and counting linefeeds is several times faster than reading line by line. In a few cases the two approaches were close. So, I definitely recommend gbarr's solution. Benchmark code follows:

#!/usr/bin/perl use warnings; use strict; use Benchmark; timethese(100, { 'line_by_line' => q{ open(INFILE,'test.dat') or die "open: $!"; while(<INFILE>){}; close(INFILE); }, 'chunk' => q{ $count = 0; open(INFILE,'test.dat') or die "open: $!"; local $/=\1024000; while(<INFILE>) { $count += tr/\n// } close(INFILE); } });
On some sample data, this produces the following:
Benchmark: timing 100 iterations of chunk, line_by_line... chunk: 3 wallclock secs ( 1.58 usr + 0.70 sys = 2.28 CPU) @ 43 +.80/s (n=100) line_by_line: 12 wallclock secs ( 9.54 usr + 0.72 sys = 10.27 CP +U) @ 9.74/s (n=100)

Update #2 After further playing around and looking at other nodes on the same topic it seems that sysread is even faster still. Doing new benchmarks utilizing the following code

open INFILE,'test.dat' or die "open: $!"; my $count = 0; while (sysread(INFILE,$_,102400)) { $count += tr/\n//; }
yields
/home/rhet/misc> ./testfcount.pl Benchmark: timing 100 iterations of chunk, line_by_line, sysread... chunk: 12 wallclock secs ( 8.00 usr + 3.52 sys = 11.53 CPU) @ 8 +.68/s (n=100) line_by_line: 154 wallclock secs (149.20 usr + 3.29 sys = 152.50 + CPU) @ 0.66/s (n=100) sysread: 6 wallclock secs ( 4.09 usr + 1.62 sys = 5.71 CPU) @ +17.52/s (n=100)
I tried a variety of files and was able to find certain files that made one or the other faster but in general sysread was usually the fastest and quite often a lot faster. On files with really short lines the line_by_line method was VERY slow but on files with much larger lines, the line_by_line method was often faster than the other two. In general though, it looks like sysread is your best bet. You could probably make further optimizations by changing the size of the block you read with sysread but these would likely be dependent on a particular configuration of a particular platform.


Comment on (RhetTbull) Re: Looking for a better way to get the number of lines in a file...
Select or Download Code
Re: Re: Looking for a better way to get the number of lines in a file...
by gbarr (Monk) on Dec 10, 2001 at 22:20 UTC
    Reading the file line-by-line is not going to be the most efficient way, esp. if the file is large. Instead you could read the file in blocks and count the newline characters

    local $/=\102400; while(<FILE>) { $count += tr/\n// }

    This example reads in chunks of 100K.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://130658]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-08-29 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (275 votes), past polls