in reply to Re^2: Help performing "random" access on very very large file
in thread Help performing "random" access on very very large file
One very effective optimisation for reducing the size of the index, is to store just lengths rather than full absolute offsets for most of the lines.
Say for example, that you know that all the lines in your 500GB file are less than 255 characters long, and for calculation purposes, average 100 characters. Then if you build a full, absolute index using the pack 'd' template (required for anything over 4GB), then your index itself is going to be 500GB / 100 * 8 ~= 40 GB!
However, if you store the absolute index of every (say) 1000th line, and 999 x 1-byte lengths for the intervening lines, the index size becomes: 500 GB / (100*1000) * 1007 ~= 5GB.
Each record within your index is then a fixed 1007 bytes, and the process of locating any individual line becomes:
By using the 'checksum' unpack template (%n-bits) to perform the summing of the intermediate lengths, the calculation of the composite offset is really very fast. This puts any line of your huge file just two reads away.
Note: There is nothing special about the number 1000 I've used in this example except that it is convenient for calculations. As we are calculating a 32-bit offset (%32C), and representing length with 8-bit unsigned chars, the theoretical maximum, absolute offset spacing is 2**32 / 2**8 == 2**24 == 16 MB.
A quick test suggests that it takes Perl ~ 60 milliseconds to calculate the sum of 16 MB chars. That's far less time than having to read even 1 extra data record.
However, there is little extra saving to be had by further decreasing the frequency of the absolute offsets.
The checksum time for 999 bytes: 112 microseconds.
The checksum time for 64k bytes: 320 microseconds.
The checksum time for 16MB bytes: 60 milliseconds.
So, you see there is little extra index space saving by decreasing the frequency of absolute offsets, but there is a considerable calculation penalty in doing so. You also have to factor in the extra time taken to read the larger index records from the index file.