Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^2: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)

by demerphq (Chancellor)
on Jul 25, 2004 at 23:01 UTC ( [id://377316]=note: print w/replies, xml ) Need Help??


in reply to Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
in thread Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)

Im confused, why wouldnt you just use a single table, with file_num,item_num and num_val as the data? Presuming that we can use four bytes per field we have 12 bytes per record. Thus 1 million records is ~12MB, assuming 100 records per file, we are looking at 120 MB no?

My point here is that unless Im missing something (which i suspect I am) that neither of the ways you describe is how I would solve this problem with an RDBMS engine. BLOBs are a bad idea as they almost always allocate a full page (one cluster iirc) regardless of how big the BLOB is. And using millions of tables just seems bizarre as the overheads of managing the tables will be ridiculous. I suspect, but dont know for sure that Sybase would be very unhappy with a DB with a million tables in it, but i know for sure that it is quite happy to have tables with 120 million records in them.


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi


Replies are listed 'Best First'.
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by BrowserUk (Patriarch) on Jul 26, 2004 at 00:22 UTC

    As described by the OP, there are 1,000,000(+) binary files containing (a variable number of) 4-byte integers often less than 1kb, and usually less than 4kb. Assuming an average of 2kb/512 integers per file that gives 2,048*1,000,000 = 1.9 GB. The aim was to save 'wasted disc space' due to clustersize round-up.

    Any DB scheme that uses a single table and 2x 4-byte integer indices per number will require (minimum) 12 * 512 * 1,000,000 = 5.7 GB.

    The extra space is required because the two indices, fileno & itemno(position) are implicite in the original scheme, but must be explicit in the 'one table/one number per tuple' scheme.

    The other alternative I posed was to store the each file (1..1024 4-byte integers) from the filesystem scheme as LONGBLOBs thereby packing 1 file per tuple in the single table. Often BLOBS are stored as fixed length records, each occupying the maximum record size allowed regradless of the length actually stored.

    Even when they are stored as LONGVARBINARY (4-byte length+length bytes) they are not stored in the main table file, but in separate file with a 4-byte placeholder/pointer into the ancillary file. That's at least 12-bytes/file (fileno, pointer, length) * 1,000,000 extra bytes that need to be stored on disc somewhere. Any savings made through avoiding cluster round-up by packing the variable length records into a single file are mostly lost here and in the main table file.

    In addition as the OP pointed out, this sceme requires that each 'file' record be queried, appended to, and then re-written for each number added. A costly process relative to appending to the end of a named file.

    It's often forgotten that ultimately data stored in a database end's up in the filesystem (in most cases). Of course, in a corporate environment, that disc space may belong to someone else's budget and is therefore not a concern :) But if the aim is to save disc space (which may or may not be a legitimate concern--we don't know the OP's situation. Embedded systems?), then a DB won't help.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re^3: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help)
by mpeppler (Vicar) on Jul 26, 2004 at 10:58 UTC
    Sybase could handle a million tables, but, as you say, the overhead (in syscolumns and sysobjects) would be tremendous.

    BLOBS would be a bad idea from the space management perspective, and would probably be a bit slow as well due to being stored on a different page chain.

    If you are using Sybase 12.5 or later and you know that the binary data will be less than a set amount (say 4k or so) then you could use a 4k or 8k page size on the server, and use a VARBINARY(4000) (for example) to store the binary data. This would be quite fast as it is stored on the main page for the row, and wouldn't waste any space.

    Michael

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://377316]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-03-29 00:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found