Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)

by BrowserUk (Pope)
on Jul 24, 2004 at 15:47 UTC ( #377132=note: print w/replies, xml ) Need Help??


in reply to Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

Using a database (whether RDBMS or other) won't help you either save diskspace or improve performance.

  1. If you write your binary data to them as BLOBs of some type, where each BLOB represents one file.

    Each blob will, in most DBs, be stored either as a separate file within the host filing system. A million files, a million clustering roundups. No savings.

    Or as a fixed size (maximum size for the type of BLOB) chunk within a larger file. Thus, effectively making the cluster size, whatever the maximum size is for the largest file you expect to store.

  2. If you store your numbers as individual rows in a table per file. You will have a million tables, which often as not translates to a million files in the host filing system.

    But worse, to be able to retrieve those numbers by position, will require a second field in each row to record the position within the file. Thus, at least doubling the space requirement. More if you actually make that position field an index to speed access.

Building your own index is equally unlikely to help. It takes at least a 4-byte integer to index a 4-byte integer. Plus some way of indicating which file each belongs to. With a million files, that a least 20 bits per. And you still have to store the data.


I would use a single file with a fixed size chunk allocated to each file and store this in a compressing filesystem. (Or a sparse filesystem if you have one available.)

I just wrote a 1_000_000 x 4096 byte records, each containing a random number (0--1023) of random integers. The notionally 3.81 GB (4,096,000,000) file, actually occupies 2.42 GB of disc space. So even though potentially half of every 'file' is empty, the compression compenates.

It runs somewhat more slowly both the initial creation (I preallocated continguous space), and random access, than an uncompressed file, but not by much thanks to filesystem buffering. In any case, it will be considerably quicker than access via a RDBMS.

Even if your files can vary widely in used size, nulling the whole file before you start will allow the compression mechanism to reduce the 'wasted' space to a minimum. A 10 GB file containing only nulls requires less that 40MB to store.

The best bit is that using a single file saves a million directory entries in the filesystem, and having to juggle a million filehandles with associated system buffers and data structures in RAM. A nice saving. You will have to remember the 'append point' for each of the files, but that is just a million 4/8 bytes numbers. A single file of 4/8 MB.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
  • Comment on Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)

Replies are listed 'Best First'.
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by tilly (Archbishop) on Jul 24, 2004 at 16:32 UTC
    My expectation is that most databases would use a well-known datastructure (such as a BTree) to store this kind of data. Which avoids a million directory entries, and also allows for variable length data. I admit that an RDBMS might do this wrong. But I'd expect most of them to get it right first try. Certainly BerkeleyDB will.

    As for the "file with big holes" approach, only some filesystems implement that. Furthermore depending on how Perl was compiled and what OS you're on, you may have a fixed 2 GB limit on file sizes. With real data, that is a barrier that you're probably not going to hit. With your approach, the file's size will always be a worst case. (And if your assumption on the size of a record is violated, you'll be in trouble - you've recreated the problem of the second situation that you complained about in point 1.)

    I'd also be curious to see the relative performance with real data between, say, BerkeleyDB and "big file with holes". I could see it coming out either way. However I'd prefer BerkeleyDB because I'm more confident that it will work on any platform, because it is more flexible (you aren't limited to numerical offsets) and because it doesn't have the record-size limitation.

      A 2GB filesize limit is definitely a problem with the big file approach. Two possible ways to avoid this if you still want to go this way:
      - the obvious: split the big file up into n files. This would also make the "growing" operation less expensive
      - if some some subfiles aren't growing very much at all, you could actually decrease the size allocated to them at the same time you do the grow operation.

      Actually, if you wanted to get really spiffy, you could have it automatically split the big file in half when it hits some threshold...then split any sub-big files as they hit the threshold, etc...

      BerkeleyDB is definitely sounding easier...but I still think this would be a lot of fun to write! (Might be a good Meditation topic...there are times when you might want to just DIY because it would be fun and/or a good learning experience.)

      Brad

      My expectation is that most databases would use a well-known datastructure (such as a BTree) to store this kind of data. Which avoids a million directory entries, and also allows for variable length data. I admit that an RDBMS might do this wrong. But I'd expect most of them to get it right first try. Certainly BerkeleyDB will.

      Using DB_File:

      1. 512,000,000 numbers appended randomly to one of 1,000,000 records indexed by pack 'N', $fileno

        Actual data stored (1000000 * 512 * 4) : 1.90 GB

        Total filesize on disk : 4.70 GB

        Total runtime (projected based on 1%) : 47 hours

      2. 512,000,000 numbers written one per record, indexed by pack 'NN', $fileno, $position (0..999,999 / 0 .. 512 (ave)).

        Actual data stored (1000000 * 512 * 4) : 1.90 GB

        Total filesize on disk : 17.00 GB (Estimate)

        Total runtime (projected based on 1%) : 80 hours* (default settings)

        Total runtime (projected based on 1%) : 36 hours* ( cachesize => 100_000_000 )

        (*) Projections based on 1% probably grossly under-estimate total runtime as it was observed that even at these low levels of fill, each new .1% required longer than the previous.

        Further, I left the latter test running while I slept. It had reached 29.1% prior to leaving it. 5 hours later it had reached 31.7%. I suspect that it might never complete.

      Essentially, this bears out exactly what I predicted at Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help).


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

        That's great. You may have already tried this and it might be moot, but does presizing the "array" to 512_000_000 elements help with performance?

      Care to offer some code for comparison?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
        Care to offer some code for comparison?

        I thought that the documentation was pretty clear. But here is a snippet using the older DB_File (because it is what I happen to have installed here):

        use DB_File; my $db = tie (my %data, 'DB_File', $file, O_RDWR|O_CREAT, 0640, $DB_BTREE) or die $!; # Now use %data directly, albeit with tie overhead. # Or use the OO interface (put/get) on $db.
Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by demerphq (Chancellor) on Jul 25, 2004 at 23:01 UTC

    Im confused, why wouldnt you just use a single table, with file_num,item_num and num_val as the data? Presuming that we can use four bytes per field we have 12 bytes per record. Thus 1 million records is ~12MB, assuming 100 records per file, we are looking at 120 MB no?

    My point here is that unless Im missing something (which i suspect I am) that neither of the ways you describe is how I would solve this problem with an RDBMS engine. BLOBs are a bad idea as they almost always allocate a full page (one cluster iirc) regardless of how big the BLOB is. And using millions of tables just seems bizarre as the overheads of managing the tables will be ridiculous. I suspect, but dont know for sure that Sybase would be very unhappy with a DB with a million tables in it, but i know for sure that it is quite happy to have tables with 120 million records in them.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


      As described by the OP, there are 1,000,000(+) binary files containing (a variable number of) 4-byte integers often less than 1kb, and usually less than 4kb. Assuming an average of 2kb/512 integers per file that gives 2,048*1,000,000 = 1.9 GB. The aim was to save 'wasted disc space' due to clustersize round-up.

      Any DB scheme that uses a single table and 2x 4-byte integer indices per number will require (minimum) 12 * 512 * 1,000,000 = 5.7 GB.

      The extra space is required because the two indices, fileno & itemno(position) are implicite in the original scheme, but must be explicit in the 'one table/one number per tuple' scheme.

      The other alternative I posed was to store the each file (1..1024 4-byte integers) from the filesystem scheme as LONGBLOBs thereby packing 1 file per tuple in the single table. Often BLOBS are stored as fixed length records, each occupying the maximum record size allowed regradless of the length actually stored.

      Even when they are stored as LONGVARBINARY (4-byte length+length bytes) they are not stored in the main table file, but in separate file with a 4-byte placeholder/pointer into the ancillary file. That's at least 12-bytes/file (fileno, pointer, length) * 1,000,000 extra bytes that need to be stored on disc somewhere. Any savings made through avoiding cluster round-up by packing the variable length records into a single file are mostly lost here and in the main table file.

      In addition as the OP pointed out, this sceme requires that each 'file' record be queried, appended to, and then re-written for each number added. A costly process relative to appending to the end of a named file.

      It's often forgotten that ultimately data stored in a database end's up in the filesystem (in most cases). Of course, in a corporate environment, that disc space may belong to someone else's budget and is therefore not a concern :) But if the aim is to save disc space (which may or may not be a legitimate concern--we don't know the OP's situation. Embedded systems?), then a DB won't help.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      Sybase could handle a million tables, but, as you say, the overhead (in syscolumns and sysobjects) would be tremendous.

      BLOBS would be a bad idea from the space management perspective, and would probably be a bit slow as well due to being stored on a different page chain.

      If you are using Sybase 12.5 or later and you know that the binary data will be less than a set amount (say 4k or so) then you could use a 4k or 8k page size on the server, and use a VARBINARY(4000) (for example) to store the binary data. This would be quite fast as it is stored on the main page for the row, and wouldn't waste any space.

      Michael

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://377132]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2021-10-24 16:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (89 votes). Check out past polls.

    Notices?