http://www.perlmonks.org?node_id=377089


in reply to Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
in thread Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

Literally, you cannot INSERT (in other words, grow a file by adding something to the middle). You can only append to files, or overwrite what's in the middle. Disk operating systems don't grow files from the middle. So the commonly used solution is to read the file one line at a time, writing out to a new file one line at a time... when you get to the part where you want to insert, write out the new data, and then continue writing the remainder of the old data to the new file. When finished, replace the old file with the new one. This process is slow for big files with lots of 'inserts'. This is where databases make sense.

Dave

  • Comment on Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

Replies are listed 'Best First'.
Re^4: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
by rjahrman (Scribe) on Jul 24, 2004 at 06:01 UTC
    As I think about it, the best option might be to insert everything in a database (with columns fileID and intValue), then--after it's all been added to the database--loop through each fileID and add its values to the mega-file (but delete rows as they're inserted to save space).

    FFR, the speed of this means nothing . . . it's all about disk space conservation and the speed at which it can be read back in.

      If speed meant NOTHING you'd do it by hand.

      The point is that inserting into a flat file means rewriting the entire file each time, unless you store updates and do them in groups. Imagine, if your 2mb flat file grows by 4 bytes each iteration, you're moving around 2mb of data each time you try to add 4 bytes. That's a 2,000,000 bytes of data reading and rewriting for each 4 byte insertion. Speed has got to mean something.

      I think your solution may be a good one. If building up the dataset is a one-time deal, do it in a database, and then transfer the completed product to a flat file where it can be read quickly.


      Dave

        Sorry if I came off as a bit ignorant. :)

        I agree that moving that data around like that is not an option. I just meant that if building this file take twice as long, but yields a file 10% smaller, it's worth it in this particular case. The file will only be built occasionally (like every other month).