Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)

by tilly (Archbishop)
on Jul 24, 2004 at 16:32 UTC ( [id://377139]=note: print w/replies, xml ) Need Help??


in reply to Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
in thread Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

My expectation is that most databases would use a well-known datastructure (such as a BTree) to store this kind of data. Which avoids a million directory entries, and also allows for variable length data. I admit that an RDBMS might do this wrong. But I'd expect most of them to get it right first try. Certainly BerkeleyDB will.

As for the "file with big holes" approach, only some filesystems implement that. Furthermore depending on how Perl was compiled and what OS you're on, you may have a fixed 2 GB limit on file sizes. With real data, that is a barrier that you're probably not going to hit. With your approach, the file's size will always be a worst case. (And if your assumption on the size of a record is violated, you'll be in trouble - you've recreated the problem of the second situation that you complained about in point 1.)

I'd also be curious to see the relative performance with real data between, say, BerkeleyDB and "big file with holes". I could see it coming out either way. However I'd prefer BerkeleyDB because I'm more confident that it will work on any platform, because it is more flexible (you aren't limited to numerical offsets) and because it doesn't have the record-size limitation.

  • Comment on Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)

Replies are listed 'Best First'.
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by bgreenlee (Friar) on Jul 24, 2004 at 17:36 UTC

    A 2GB filesize limit is definitely a problem with the big file approach. Two possible ways to avoid this if you still want to go this way:
    - the obvious: split the big file up into n files. This would also make the "growing" operation less expensive
    - if some some subfiles aren't growing very much at all, you could actually decrease the size allocated to them at the same time you do the grow operation.

    Actually, if you wanted to get really spiffy, you could have it automatically split the big file in half when it hits some threshold...then split any sub-big files as they hit the threshold, etc...

    BerkeleyDB is definitely sounding easier...but I still think this would be a lot of fun to write! (Might be a good Meditation topic...there are times when you might want to just DIY because it would be fun and/or a good learning experience.)

    Brad

Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by BrowserUk (Patriarch) on Jul 25, 2004 at 11:22 UTC
    My expectation is that most databases would use a well-known datastructure (such as a BTree) to store this kind of data. Which avoids a million directory entries, and also allows for variable length data. I admit that an RDBMS might do this wrong. But I'd expect most of them to get it right first try. Certainly BerkeleyDB will.

    Using DB_File:

    1. 512,000,000 numbers appended randomly to one of 1,000,000 records indexed by pack 'N', $fileno

      Actual data stored (1000000 * 512 * 4) : 1.90 GB

      Total filesize on disk : 4.70 GB

      Total runtime (projected based on 1%) : 47 hours

    2. 512,000,000 numbers written one per record, indexed by pack 'NN', $fileno, $position (0..999,999 / 0 .. 512 (ave)).

      Actual data stored (1000000 * 512 * 4) : 1.90 GB

      Total filesize on disk : 17.00 GB (Estimate)

      Total runtime (projected based on 1%) : 80 hours* (default settings)

      Total runtime (projected based on 1%) : 36 hours* ( cachesize => 100_000_000 )

      (*) Projections based on 1% probably grossly under-estimate total runtime as it was observed that even at these low levels of fill, each new .1% required longer than the previous.

      Further, I left the latter test running while I slept. It had reached 29.1% prior to leaving it. 5 hours later it had reached 31.7%. I suspect that it might never complete.

    Essentially, this bears out exactly what I predicted at Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

      That's great. You may have already tried this and it might be moot, but does presizing the "array" to 512_000_000 elements help with performance?

        Firstly, the performance wasn't really the issue at hand. The question was related to disk-storage rather than performance, The reason for highlighting the time taken was to excuse my having based my conclusions upon a miserly 1% of a complete test rather than having sat around for 2 days x N tests :)

        I'm not really sure what you mean by 'array' in the context of using DB_File?

        No array of 512_000_000 elements is ever generated. It's doubtful whether most 32-bit machines could access that much memory.

        The test program just looped 512_000_000 times (or would have if I had let it), and generated a random fileno and data value at each iteration. These are then used to fill in the values of a tied hash that is underlain by a disk-based btree DB file.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by BrowserUk (Patriarch) on Jul 24, 2004 at 17:35 UTC

    Care to offer some code for comparison?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      Care to offer some code for comparison?

      I thought that the documentation was pretty clear. But here is a snippet using the older DB_File (because it is what I happen to have installed here):

      use DB_File; my $db = tie (my %data, 'DB_File', $file, O_RDWR|O_CREAT, 0640, $DB_BTREE) or die $!; # Now use %data directly, albeit with tie overhead. # Or use the OO interface (put/get) on $db.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://377139]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-18 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found