Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Estimate line count in text file

by xaprb (Scribe)
on Jul 18, 2008 at 12:43 UTC ( #698594=perlquestion: print w/replies, xml ) Need Help??
xaprb has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that might want to estimate completion on large files. Any thoughts on the best way to quickly estimate the line count in a very large text file? My idea was to get the file size, and if it's less than 100MB just use wc -l. Otherwise take 100 4 KiB (aligned) samples by seeking to pre-calculated offsets in the file and reading 4096 bytes, counting the number of bytes between each newline and taking that as the line length; then the number of lines is $filesize / ($avg_line_len + length("\n")).

Update: replaced "seeking through" with "seeking to pre-calculated offsets in"

Replies are listed 'Best First'.
Re: Estimate line count in text file
by GrandFather (Sage) on Jul 18, 2008 at 13:12 UTC

    Why not use -s to find the file size then use tell from time to time to determine how far through you are and make a time remaining estimate from that?

    Perl is environmentally friendly - it saves trees
      That's a great idea!
Re: Estimate line count in text file
by marto (Bishop) on Jul 18, 2008 at 12:51 UTC
      Sure. I saw all of these. (Though I do not see any reply by davorg). They are all exact, not estimated. The key here is "estimated because the file is Very, Very Large." Reading the whole file may be unacceptable.

        My mistake, I was referring to davidrw's reply. I would suggest (if you have not already done so) benchmarking their Tie::File solution with some 'large' files, since peoples definition of what constitutes a large file differs :)


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://698594]
Approved by marto
[choroba]: Unfortunately, none of it is online
[haukex]: I figured that POD tests make sense, but only as author tests
[choroba]: I mean, the slides are, but not the makefile with scripts to create them
[Corion]: haukex: I've only now arrived at that revelation ;)
[Corion]: choroba: I use spod5, which also has that support, and also implements its own kinda-make stuff
[haukex]: But that module I just linked to assumes that most verbatim blocks are runnable code, I have other modules where that's not the case, so there I just copy-and-paste the synopsis into the author tests...
[haukex]: not the most efficient, but then again, I don't have that many modules on CPAN :-)
[Corion]: haukex: Yes, but if it's only supposed to run on my machine, I can be far more liberal with how I extract the code etc.
[Corion]: haukex: Yes - I see the benefit of using Dist::Zilla for people with 150+ modules on CPAN, but I don't see it for myself, and I'm always put off from contributing to such modules because they require a lot of toolchain setup that I don't want to ...
[Corion]: ... spend time on if I only want to provide a short patch

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (11)
As of 2017-02-27 12:29 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (385 votes). Check out past polls.