in reply to Read Some lines in Tera byte file
An approximate index, (e.g. the position of every thousandth line of data) is probably a very reasonable approach to use here. (SQLite is amazingly useful for such things.) You really only have to get the computer “into the general neighborhod,” because when it does the disk-seek it’s going to bring in several sectors’ worth of data.
Another very useful technique, if you can manage it, is to first sort your update (or search) keys into the same order as the file itself. Now, you can move through the data one time, perhaps sequentially. Whatever updates or changes you need to make to any particular region of the file, you will be able to do “all at once, and then move on.”
These strategies were, of course, absolutely necessary when the only “mass” storage device we possessed were digital reel-to-reel tapes that stored a few hundred bytes per inch, but they are still very-surprisingly apropos to this day. Although we have high-density disks that rotate at thousands of RPMs, many of our “ruling constraints” when dealing with large data sets are still physical ones. “Seek time,” and “rotational latency.”
Or, in this case ... network time and bandwidth! Is it possible, for instance, to do this work on the server computer directly? When dealing with a huge network-based file, you really, really want to do that... because otherwise, every one of those trillions of bytes are going to be transmitted down the pipe between the two computers. Z-z-z-z-z-z-z....