Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Read Some lines in Tera byte file

by sundialsvc4 (Abbot)
on Oct 13, 2010 at 11:40 UTC ( #865061=note: print w/replies, xml ) Need Help??

in reply to Read Some lines in Tera byte file

An approximate index, (e.g. the position of every thousandth line of data) is probably a very reasonable approach to use here.   (SQLite is amazingly useful for such things.)   You really only have to get the computer “into the general neighborhod,” because when it does the disk-seek it’s going to bring in several sectors’ worth of data.

Another very useful technique, if you can manage it, is to first sort your update (or search) keys into the same order as the file itself.   Now, you can move through the data one time, perhaps sequentially.   Whatever updates or changes you need to make to any particular region of the file, you will be able to do “all at once, and then move on.”

These strategies were, of course, absolutely necessary when the only “mass” storage device we possessed were digital reel-to-reel tapes that stored a few hundred bytes per inch, but they are still very-surprisingly apropos to this day.   Although we have high-density disks that rotate at thousands of RPMs, many of our “ruling constraints” when dealing with large data sets are still physical ones.   “Seek time,” and “rotational latency.”

Or, in this case ... network time and bandwidth!   Is it possible, for instance, to do this work on the server computer directly?   When dealing with a huge network-based file, you really, really want to do that... because otherwise, every one of those trillions of bytes are going to be transmitted down the pipe between the two computers.   Z-z-z-z-z-z-z....

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://865061]
[stevieb]: I'll see how to get in touch with someone and advise that there's an issue. Thanks for helping me confiirm pryrt!
[pryrt]: also mismatches; but matches their sha1
[pryrt]: The mismatched ones have a Jan 23 2017 Last-Modified header -- I wonder if they rezipped them and forgot to update the sha1

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2017-03-29 21:06 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (353 votes). Check out past polls.