Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: Displaying/buffering huge text files

by sfink (Deacon)
on Feb 24, 2005 at 05:40 UTC ( #433957=note: print w/replies, xml ) Need Help??

in reply to Displaying/buffering huge text files

If you're sure that indexing the start offset of every line will fit in memory, then go for it. You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

For that case, and the general case of even larger files, consider a variant of the index-every-nth line idea: index based on disk block (or more likely, some multiple of the disk block). Say you use a block size of 8KB. Then keep the line number of the first complete line starting within each block. When seeking to a given line number, you do a binary search in your index to find the block number that contains the largest line number less than or equal to the line you're looking for. Then you read the block in, scanning linearly through the text for the line you want.

This approach deals with the problematic cases more gracefully -- if you have a huge number of newlines, you'll still only read the block containing the line you want. (Well, you might have to read the following block too, to get the whole line.) Or, if you have enormous single lines, you'll never have the problem of your index giving you a starting position way before the line you want, as might happen if you were indexing every 25th line.

Generally speaking, your worst-case performance is defined in terms of the time to process some number of bytes, not lines, so you'll be better off if your index thinks in terms of bytes.

All that being said, I would guess that this would be overkill for your particular application, and you'd be better off with an offset-per-line index. It's simpler and good enough. And if that gets too big, you can always store the index in a gdbm file.

  • Comment on Re: Displaying/buffering huge text files

Replies are listed 'Best First'.
Re^2: Displaying/buffering huge text files
by scooper (Novice) on Feb 24, 2005 at 23:43 UTC
    sfink: You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

    a file containing *only* newlines is not a nasty case and requires no indexing at all. If you can figure out before you read it that the file contains only newlines, you change your "figure out the seek offset" subroutine so that to get to line 120000 it seeks to byte 120000. It doesn't get any easier!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://433957]
[Corion]: Meh... Github (understandably) doesn't allow files larger than 100MB, but my rebase of an older repo (from Github) includes such a file and now I can't push my changes there
[Corion]: Maybe that is the push I need to finally try out Gitprep ;)
[Corion]: On the upside, I should finally improve Image::CCV to also do ImageNet classification using their pretrained parameters
[Lady_Aleena]: Hello Corion.
[Corion]: Hi Lady_Aleena!
[Lady_Aleena]: Corion, how are things?
[Corion]: Lady_Aleena: Quite good ;) I'm working four days now, instead of five, which helps my mood and my weekends tremendously
[Corion]: My Perl output hasn't recovered, but as $work is still somewhat stressful, I don't think this would be different with a five day workdweek
[Lady_Aleena]: Corion, that is a good thing (TM) 8)
[Corion]: And this evening, I'm actually dusting off some old module of mine and bringing it up to the last version of the library I'm wrapping

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2017-09-24 18:09 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (274 votes). Check out past polls.