Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Estimate line count in text file

by xaprb (Scribe)
on Jul 18, 2008 at 12:43 UTC ( #698594=perlquestion: print w/replies, xml ) Need Help??
xaprb has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that might want to estimate completion on large files. Any thoughts on the best way to quickly estimate the line count in a very large text file? My idea was to get the file size, and if it's less than 100MB just use wc -l. Otherwise take 100 4 KiB (aligned) samples by seeking to pre-calculated offsets in the file and reading 4096 bytes, counting the number of bytes between each newline and taking that as the line length; then the number of lines is $filesize / ($avg_line_len + length("\n")).

Update: replaced "seeking through" with "seeking to pre-calculated offsets in"

Replies are listed 'Best First'.
Re: Estimate line count in text file
by GrandFather (Sage) on Jul 18, 2008 at 13:12 UTC

    Why not use -s to find the file size then use tell from time to time to determine how far through you are and make a time remaining estimate from that?

    Perl is environmentally friendly - it saves trees
      That's a great idea!
Re: Estimate line count in text file
by marto (Bishop) on Jul 18, 2008 at 12:51 UTC
      Sure. I saw all of these. (Though I do not see any reply by davorg). They are all exact, not estimated. The key here is "estimated because the file is Very, Very Large." Reading the whole file may be unacceptable.

        My mistake, I was referring to davidrw's reply. I would suggest (if you have not already done so) benchmarking their Tie::File solution with some 'large' files, since peoples definition of what constitutes a large file differs :)


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://698594]
Approved by marto
[davido]: I am not finding closing STDIN to be an adequate means of making getlogin return undef.
[Corion]: Maybe doing a double-fork (daemonizing) can make go that information away, but maybe not
[Corion]: But I think my knowledge of unix/Linux datastructures is several decades out of date, so I don't really know what information it keeps on processes
[oiskuu]: The useful bits that relate to your process can be found under /proc/self. What information are you thinking of? Tty name?
[tye]: I just daemonized and getlogin() still knew who I had been.
[tye]: perhaps loginuid ? Not that I concede that something not being in /proc means it is not useful.

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2017-06-23 19:36 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (554 votes). Check out past polls.