http://www.perlmonks.org?node_id=377194


in reply to Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
in thread Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

I may have missed something here and therefore the following approach might be oversimplified.

I'd write the data to a flat file in the first pass, with the file structure being lines with key-value-pairs. The key would represent the "filename" and the value one of the "4-byte" values of the OP. Make sure a new file is started before the max filelength for the OS or the FS is reached.

If the "filename" is too long, I would create a separate file mapping each "filename" to a shorter key. Obviously each key will occur as many times as there are values for it, each time on a separate line. The order of the values (should they matter) will be preserved in the order of the lines.

In the second pass, once all the values have been written to the file(set), analyze it once for each key and write all of the values per key into a single arbitrary-length record of a new target file(set). In the third pass, create the index on the target file(set).

In this way the first pass file(set) will accept values for keys in any order, appending them to the end of the file and will not waste space for large records that won't be needed most of the time. The second pass will take a whole lot of time, but as I understand it time is not the issue here.

Generally, if space is a major consideration, a DBMS is the last thing I would look at. There's just too much overhead there, in order to make it work with all kinds of data structures.

Update: Corrected spelling mistake. Added Comment on DB.

  • Comment on Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)