Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Looking for a better way to get the number of lines in a file...

by coec (Chaplain)
on Dec 10, 2001 at 17:44 UTC ( [id://130654]=perlquestion: print w/replies, xml ) Need Help??

coec has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. The following sub-routine (function) returns the number of lines in a file. I can't help thinking there must be a better way to do this. Any ideas?
ub File_Length { $DEBUG && warn "File_Length\n"; $DEBUG && warn "FILE = '" . $_ . "'\n"; $FILE = chomp($_); $DEBUG && warn "FILE = " . $FILE . "\n"; $COUNT = 0; while (<$FILE>) { $DEBUG && warn "COUNT = " . $COUNT . "\n"; $COUNT++; } $DEBUG || print "File Length is " . $COUNT . "\n"; $COUNT; }

Replies are listed 'Best First'.
(RhetTbull) Re: Looking for a better way to get the number of lines in a file...
by RhetTbull (Curate) on Dec 10, 2001 at 18:06 UTC
    The special perl variable $. tells you the current line of the file you're reading. Hence, at the end of the file, it gives you the total number of lines:
    while (<FILE>){}; print "length = $.";
    As an aside, it's typical in perl to reserve all uppercase variable names for file handles and such and not ordinary variables as you do. IMHO it makes your code easier to read.

    Update:In light of gbarr's post below I did some benchmarking. gbarr is correct that snarfing the file in chunks is much faster. I tested quite a few large files (>1 million lines) and found that in most cases, reading 100K chunks and counting linefeeds is several times faster than reading line by line. In a few cases the two approaches were close. So, I definitely recommend gbarr's solution. Benchmark code follows:

    #!/usr/bin/perl use warnings; use strict; use Benchmark; timethese(100, { 'line_by_line' => q{ open(INFILE,'test.dat') or die "open: $!"; while(<INFILE>){}; close(INFILE); }, 'chunk' => q{ $count = 0; open(INFILE,'test.dat') or die "open: $!"; local $/=\1024000; while(<INFILE>) { $count += tr/\n// } close(INFILE); } });
    On some sample data, this produces the following:
    Benchmark: timing 100 iterations of chunk, line_by_line... chunk: 3 wallclock secs ( 1.58 usr + 0.70 sys = 2.28 CPU) @ 43 +.80/s (n=100) line_by_line: 12 wallclock secs ( 9.54 usr + 0.72 sys = 10.27 CP +U) @ 9.74/s (n=100)

    Update #2 After further playing around and looking at other nodes on the same topic it seems that sysread is even faster still. Doing new benchmarks utilizing the following code

    open INFILE,'test.dat' or die "open: $!"; my $count = 0; while (sysread(INFILE,$_,102400)) { $count += tr/\n//; }
    yields
    /home/rhet/misc> ./testfcount.pl Benchmark: timing 100 iterations of chunk, line_by_line, sysread... chunk: 12 wallclock secs ( 8.00 usr + 3.52 sys = 11.53 CPU) @ 8 +.68/s (n=100) line_by_line: 154 wallclock secs (149.20 usr + 3.29 sys = 152.50 + CPU) @ 0.66/s (n=100) sysread: 6 wallclock secs ( 4.09 usr + 1.62 sys = 5.71 CPU) @ +17.52/s (n=100)
    I tried a variety of files and was able to find certain files that made one or the other faster but in general sysread was usually the fastest and quite often a lot faster. On files with really short lines the line_by_line method was VERY slow but on files with much larger lines, the line_by_line method was often faster than the other two. In general though, it looks like sysread is your best bet. You could probably make further optimizations by changing the size of the block you read with sysread but these would likely be dependent on a particular configuration of a particular platform.
      Reading the file line-by-line is not going to be the most efficient way, esp. if the file is large. Instead you could read the file in blocks and count the newline characters

      local $/=\102400; while(<FILE>) { $count += tr/\n// }

      This example reads in chunks of 100K.

Re: Looking for a better way to get ...
by stefan k (Curate) on Dec 10, 2001 at 17:58 UTC
Re (tilly) 1: Looking for a better way to get the number of lines in a file...
by tilly (Archbishop) on Dec 10, 2001 at 18:36 UTC
    Several things about programming style spring to mind.

    The first is that you are using global variables. A brush with strict.pm can help you avoid using them, which will lessen the chances that subroutines will interfere with each other.

    The second is that Perl has interpolation so that you don't need to concatenate. It is simpler to write "FILE: $FILE\n" and most people will find it somewhat easier.

    Thirdly from comprehension studies, people's comprehension of code is significantly improved when your indent is in the range 2-4. You are using 8 (full tab-stop) and that is going to make any complex logic substantially harder to follow.

    And last but not least, ALL CAPS IS SHOUTING! Anyone who has been around on the net for a while will be used to that convention, and it makes your code very hard to read for any length of time. Yes, this is an "important irrelevancy", something that is important to be consistent on, but which choice you make is fundamentally irrelevant. However the connection between capitals and shouting is rather widely accepted.

    Oh, and a final book recommendation. Code Complete by Steve McConnell is a good place to learn a lot about what makes up good (procedural) programming style. It is a classic. Read it.

Re: Looking for a better way to get the number of lines in a file...
by Biker (Priest) on Dec 10, 2001 at 18:26 UTC

    Well, what's a line?

    The first line starts by the beginning of the file and ends at the first LF (or CRLF depending upon your operating system).
    Most of the lines will start just after the preceeding line termination and continue to the next line termination (LF or CRLF).
    Finally, one line may continue until end of file. This is not really clean, since for a 'true text file', the last byte(s) in the file should be a line terminator.
    All of the above for a file that contains at least four lines.

    Now, the problem has become to count the number or line terminators in the file and potentially add one to that. ("Until end of file.")

    To count occurences of something in a (non indexed) file you will have to read it from the beginning until the end.
    You could read the file byte by byte and verify each byte if it's a CR or an LF. If it's a CR, will the next byte be LF? (Assuming DOS/Windows as an OS.)
    Why not let the file read figure out what is a line terminator for you? Then read one line at a time, count it and then throw it away.
    Using $. is fine if you make some assumptions. $. is reset when the file handle is closed. From perlvar: Because <> never does an explicit close, line numbers increase across ARGV files. Take care.

    There are smarter ways of doing it. But then you'll have to start making assumptions on how a line ends. And that the last line actually is correctly terminated. (Well, it should. But...)

    f--k the world!!!!
    /dev/world has reached maximal mount count, check forced.

Re: Looking for a better way to get the number of lines in a file...
by strat (Canon) on Dec 10, 2001 at 19:19 UTC
    Maybe under Unix the following is rather fast, although it is uses the external program wc:
    my $length = `wc -l $filename`; chomp($length);

    Best regards,
    perl -e "print a|r,p|d=>b|p=>chr 3**2 .7=>t and t"

      I have noticed that if you use wc on a file, it will also return the name of the file as well. You need to change the code as follows:
      my $string=`wc -l $filename`; my ($file,$length)=split " ",$filename; chomp($length);


      ...We will have peace, when you and all your works have perished -- and the works of your dark master to whom you would deliver us. You are a liar, Saruman, and a corrupter of men's hearts. -- Theoden in The Two Towers --
        That looks a bit off to me. First of all, $string contains the information, but you are splitting $filename. Secondly, every wc implementation that I've ever seen outputs the linecount before the filename, but you have them reversed....

        If I were going to rewrite this snippet and was forced to use wc, I might do something like:

        #!/usr/bin/perl -wT use strict; %ENV = (PATH => '/bin:/usr/bin'); # give us a happy enviornment my $filename = '/etc/services'; # filename to be checked my $length = do { my $string = `wc -l $filename`; # get output of wc die "wc error $?" if $?; # die if wc chokes for some reason no warnings 'numeric'; # turn off a pesky warning ;-) $string + 0; # numerify it with +0 }; print "length = $length\n"; # "length = 331" on my machine
        Update:
        *sigh* /g still gives me trouble some times.... The numerify line above could also have been:
        ($string =~ /\d+/g)[0];
        which would eliminate the need to turn off the warning. Therefore, the above can all be squeezed down into:
        my $length = (`wc -l $filename` =~ /\d+/g)[0]; # get output of wc die "wc error $?" if $?; # die if wc chokes for some reason

        -Blake

Re: Looking for a better way to get the number of lines in a file...
by simul (Novice) on Mar 13, 2011 at 12:06 UTC
    I posted Perl code on my blog that solves this problem using windowed sampling and averaging. If anyone arrives at this thread with files that have millions of lines... it's useful. Reposting the code here for anyone to review:
      Code on the blog has been updated. Always accurate for predictive/repeated files. Does an accurate linear model estimate with gzipped files. Current source: documentroot.com/alc
Re: Looking for a better way to get the number of lines in a file...
by Kage (Scribe) on Dec 10, 2001 at 21:54 UTC
    Simple..
    $file="filename.ext"; open(COUNT,"$file") || err("Oops.. $!"); @lines = <COUNT>; close(COUNT); $count="0"; foreach $line (@lines) { $count++; } print "Content-type: text/html\n\n"; print "$file's length is $count";
    10 lines if you don't count the line breaks
    "I am loved by few, hated by many, and wanted by plenty." -Kage (Alex)
    SkarySkriptz
      This is an adaptation of... um... I think recipe 8.6 in Cookbook, "Picking a random line from a file", though I may have seen it elsewhere. (Corrections? Credit?)
      #!/usr/bin/perl open ( FH,'data' ) or die ( "data: $!\n" ); while( <FH> ) { if ( eof ( FH ) ) { print $. } }
      What I like about this is that you aren't storing anything in memory - so this should wade through a 2GB file (as mentioned fearfully later in this thread) fairly quickly.

      blyman setenv EXINIT 'set noai ts=2'

      Reading in a full file into memory scares me. One day, someone will feed a 2GB file...

      Having said that, @lines in your example already gives the number of elements in the array, where each element should contain a line. No need to count them.

      f--k the world!!!!
      /dev/world has reached maximal mount count, check forced.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://130654]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-24 22:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found