Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

by davido (Cardinal)
on Jul 24, 2004 at 05:13 UTC ( #377083=note: print w/replies, xml ) Need Help??

in reply to Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

I would use a database like SQLite instead of a multi-megabyte flat file punished with random access.

You cannot insert data into the middle of a flat file. You can allocate a gigantic file and pre-subdivide it into fixed-length records of sufficient size that you'll never fill one up completely, but that's tricky and not scalable. You could use Tie::File to treat the file as an array, but doing massive amounts of mid-array inserts is very slow with a tied array, because again, it's really just working on a flat file behind the scenes.

This really is a problem best delt with via a database. I'm not positive SQLite is the best one for the job, but it is pretty easy to install, self-contained, and stores all of its data in one file.


  • Comment on Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

Replies are listed 'Best First'.
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
by rjahrman (Scribe) on Jul 24, 2004 at 05:42 UTC
    "You cannot insert data into the middle of a flat file."

    Is this an actual limitation, or are you saying that this is a bad idea?

    "I would use a database like SQLite"

    My concern is how the database would do this. Wouldn't it be doing the exact same thing?

    Also, since the only way to append to a BLOB that I've seen is to do an "update . . . set this_blob = concat(this_blob,new_int)", wouldn't that be even less efficient?

      Literally, you cannot INSERT (in other words, grow a file by adding something to the middle). You can only append to files, or overwrite what's in the middle. Disk operating systems don't grow files from the middle. So the commonly used solution is to read the file one line at a time, writing out to a new file one line at a time... when you get to the part where you want to insert, write out the new data, and then continue writing the remainder of the old data to the new file. When finished, replace the old file with the new one. This process is slow for big files with lots of 'inserts'. This is where databases make sense.


        As I think about it, the best option might be to insert everything in a database (with columns fileID and intValue), then--after it's all been added to the database--loop through each fileID and add its values to the mega-file (but delete rows as they're inserted to save space).

        FFR, the speed of this means nothing . . . it's all about disk space conservation and the speed at which it can be read back in.

      If you want to know how a database could tackle a problem like this of mapping IDs to arbitrary information, read this article on BTrees. Then do as perrin said and use BerkeleyDB. That solves this problem in a highly optimized way, in C.

      If the dataset is large enough that it won't fit in RAM, then you probably want to ask it to build you a BTree rather than a hash. A hash is better if the data all fits in RAM.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://377083]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2023-10-01 06:01 GMT
Find Nodes?
    Voting Booth?

    No recent polls found