http://www.perlmonks.org?node_id=433957


in reply to Displaying/buffering huge text files

If you're sure that indexing the start offset of every line will fit in memory, then go for it. You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

For that case, and the general case of even larger files, consider a variant of the index-every-nth line idea: index based on disk block (or more likely, some multiple of the disk block). Say you use a block size of 8KB. Then keep the line number of the first complete line starting within each block. When seeking to a given line number, you do a binary search in your index to find the block number that contains the largest line number less than or equal to the line you're looking for. Then you read the block in, scanning linearly through the text for the line you want.

This approach deals with the problematic cases more gracefully -- if you have a huge number of newlines, you'll still only read the block containing the line you want. (Well, you might have to read the following block too, to get the whole line.) Or, if you have enormous single lines, you'll never have the problem of your index giving you a starting position way before the line you want, as might happen if you were indexing every 25th line.

Generally speaking, your worst-case performance is defined in terms of the time to process some number of bytes, not lines, so you'll be better off if your index thinks in terms of bytes.

All that being said, I would guess that this would be overkill for your particular application, and you'd be better off with an offset-per-line index. It's simpler and good enough. And if that gets too big, you can always store the index in a gdbm file.