Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

by rjahrman (Scribe)
on Jul 24, 2004 at 05:42 UTC ( #377087=note: print w/replies, xml ) Need Help??


in reply to Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
in thread Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

"You cannot insert data into the middle of a flat file."

Is this an actual limitation, or are you saying that this is a bad idea?

"I would use a database like SQLite"

My concern is how the database would do this. Wouldn't it be doing the exact same thing?

Also, since the only way to append to a BLOB that I've seen is to do an "update . . . set this_blob = concat(this_blob,new_int)", wouldn't that be even less efficient?

  • Comment on Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

Replies are listed 'Best First'.
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
by davido (Cardinal) on Jul 24, 2004 at 05:48 UTC
    Literally, you cannot INSERT (in other words, grow a file by adding something to the middle). You can only append to files, or overwrite what's in the middle. Disk operating systems don't grow files from the middle. So the commonly used solution is to read the file one line at a time, writing out to a new file one line at a time... when you get to the part where you want to insert, write out the new data, and then continue writing the remainder of the old data to the new file. When finished, replace the old file with the new one. This process is slow for big files with lots of 'inserts'. This is where databases make sense.

    Dave

      As I think about it, the best option might be to insert everything in a database (with columns fileID and intValue), then--after it's all been added to the database--loop through each fileID and add its values to the mega-file (but delete rows as they're inserted to save space).

      FFR, the speed of this means nothing . . . it's all about disk space conservation and the speed at which it can be read back in.

        If speed meant NOTHING you'd do it by hand.

        The point is that inserting into a flat file means rewriting the entire file each time, unless you store updates and do them in groups. Imagine, if your 2mb flat file grows by 4 bytes each iteration, you're moving around 2mb of data each time you try to add 4 bytes. That's a 2,000,000 bytes of data reading and rewriting for each 4 byte insertion. Speed has got to mean something.

        I think your solution may be a good one. If building up the dataset is a one-time deal, do it in a database, and then transfer the completed product to a flat file where it can be read quickly.


        Dave

Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)
by tilly (Archbishop) on Jul 24, 2004 at 16:10 UTC
    If you want to know how a database could tackle a problem like this of mapping IDs to arbitrary information, read this article on BTrees. Then do as perrin said and use BerkeleyDB. That solves this problem in a highly optimized way, in C.

    If the dataset is large enough that it won't fit in RAM, then you probably want to ask it to build you a BTree rather than a hash. A hash is better if the data all fits in RAM.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://377087]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2023-12-11 18:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (41 votes). Check out past polls.

    Notices?