Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

(Real) Database Design

by Anonymous Monk
on Jul 14, 2004 at 07:42 UTC ( [id://374233]=perlmeditation: print w/replies, xml ) Need Help??

Greetings and Salutations,

I'm currently on a grand quest to reinvent a wheel. Well, not really a wheel, a database. This quest exists for the purpose of discovering how to effectively structure data for simplicity of programming while attempting to maintain somewhat reasonable performance. I'm leaning towards a relational, table-based model because that's all I really have an understanding of at this point :).

My humble questions are as follows:

  1. How would one persistantly store the data? All I'm doing right now is using a hierarchy of directories that basically goes rootdb/db/table.file. The table files consist of null-terminated rows. Is there a better way of structuring this?
  2. I currently have no idea how to store meta-data such as column types and sizes and how to make use of them. I'm planning on writing this for only Linux & BSD systems if that makes a significant difference. Any ideas?

I've taken a moderately in-depth look around and found very little on the subject. All the books I've found seem to focus on data normalization and sql. Any advice or pointers to any Books or FMs that I can Read would be most excellent. Many thanks for your time and assistance (:

Replies are listed 'Best First'.
Re: (Real) Database Design
by dragonchild (Archbishop) on Jul 14, 2004 at 12:37 UTC
    The issues in optimizing performance for data storage are primarily hardware/operating-system driven. A partial list of the issues/thoughts/concerns would include:
    • A hard-drive generally has seek-times in the milli-second range. RAM has seek-times in the nano-second range. That's a million-to-one difference.
    • Most datastores, like MySQL, are optimized for data retrieval. This is because the number of retrievals vs. the number of updates is generally in the 100k-to-one to million-to-one category. (That's based on experience and anecdotal evidence. Various applications have various needs. To compare - I work with two apps right now. One is almost completely retrieval and the other is about 100k-to-1.)
    • Although the SQL standard doesn't mandate data types (and SQLite, for instance, doesn't use them), most datastores will use them because that information can reduce the amount of space taken for a given row. By reducing space, you reduce the seek time for finding records. For example, if you know a column is INT(7), that means it's less than 2^24. That's 3 bytes storing it as an integer, instead of 7 as a string. (And up to 8, if you store it as a null-terminated string.) Over a million rows, that can save 4 megabytes. In addition, if you use an integer there, you can justify using fixed-length rows, which trades space (you potentially use more per row) for speed (you know exactly how wide each row is, so you only have one seek vs. potentially thousands in a naive implementation.)
    • A really neat fact about datastores is that optimizing for space actually ends up optimizing for speed, in the general case. This is because datastores are almost completely I/O-bound. The less you have to use the hard-drive, the less time you take. Also - CPU's are improving much much faster than hard-drives. (At least, over platter-based hard-drives.) So, using some form of data compression on a per-row may make sense. (And, some datastores like Oracle and MySQL do offer this option for retrieve-only tables, but that system can be easily improved upon.)
    • A concern you have to take into account is the block size. This is the smallest chunk of physical space the operating system will allocate for storage. In most OS'es, this is either 512bytes or 1024bytes. In keeping with the desire to reduce the amount of seek time, you should be aware of this number.

    Your questions are actually the same question, from different directions. There is a better way of structuring the data, and that's by keeping the metadata around. Basically, the most important piece of metadata you want is the rowsize. That tells you where the data is that you're looking for. You also want to keep some set of indices which allow you to quickly determine which row numbers have what data in what column. As for storing this metadata ... most datastores keep a separate storage area for this. Oracle keeps it as part of the data file and MySQL keeps it as a separate file (at least for MyISAM tables).

    Column types, sizes, referential integrity ... that's all used to aid the developer. SQLite is a good example of a datastore that doesn't make any use of that stuff. (Well, very, very little use.) Column types and sizes can also help the datastore in some optimizations when calculating rowsize. (q.v. above as to why this is important.)

    Oh - if you want any sort of decent performance, you will end up rewriting this in C.

    This won't help in the actual implementation, but look for stuff written by Fabian Pascal on some theory behind efficient data storage, especially with the relational model. He has a lot of ... good ... articles on the web.

    Updates:Wording changes on the sidenote about compression on a per-row basis. Added an example of savings when using data types.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      A truly knowledgeable reply! Many thanks :)

      I had a feeling that the operating system and how it views/interprets files would play a major role in the design. As for the hardware, I'll just be using x86 for now (until the xbox 2 comes out ;).

      As for the issue of storing meta data, having it in the same file as the table appeals to me. It would appear to my (rather limited) sight that having to access a separate location to retrieve the meta data would unecessarily complicate things. The issue then becomes the exact format of metadata storage. Is a fixed length at the start of a file the best approach? As more columns are entered this would seem to collapse. How about non-data type meta data such as the names of columns? If anyone knows of an example implementation or can offer advice on this it would greatly assist me.

      Now for my sillier question. As for the data files, they would have to be binary files right? All that I'll end up having is a stream of 1s and 0s that the OS attaches a label to? (As you can see I still have a bit of a conceptual leap to make ;). The Linux kernel doesn't interpret a file as anything more than a stream of bytes, correct? So when I have a data column with 01101011 the dbm will choose to interpret it as a character or as an integer based on the meta data right? If this is all correct, can anyone provide pointers to standards about common representation of data types (especially floating point numbers?).

      As for the language of choice, C would be the highest level language that I'd end up writing this in. I'd like to understand the core theory first though before I start writing my spaghetti.

      The points about block size and loading as much as possible into ram are well taken. They do make me think of one other question - how would I have to change (if at all) the structure of the db in ram as compared to on disk? Could I just load it all into a struct that's generated based on the file's meta data?

      Many thanks again for the excellent replies so far :)

        When I was speaking of hardware, I was speaking of the performance limitations of the hard-drive as opposed to RAM and CPU. The actual architecture is unimportant because the operating system deals with that. The important thing is the ratio between retrieving data from a hard-drive and retrieving the same data from RAM.

        You actually do want to have the metadata in a separate file than the actual data itself. In fact, you want it on a separate hard-drive, if possible. That way, your machine can read both files at the same time - taking advantage of multiple and/or dual-core CPUs. This also means that you can modify the metadata without having to re-write the data file. Metadata is usually a very small amount of data as opposed to the actual data itself. Remember - a database is usually only useful once you get beyond 3-8 tables and/or 10-20k rows. Metadata usually doesn't get beyond 10-20 kilobytes. Data is usually in the megabytes. You don't want to relocate 20M just to add another constraint to your 3K of metadata. This also allows you to (initially) keep your metadata in some human-readable form, like YAML or XML. (Or even Apache's configuration format, readable with Config::ApacheFormat.) That can make initial development a little easier. Or not.

        To answer your sillier question - every single file is a binary file, from everything's perspective. The only thing that makes a file text vs. binary is how a program has been told to interpret it.

        For example, FTP makes a distinction between the two only so it can translate \n in Unis to \r\n in Windows to \r in Mac. That's it. (Well, I'm sure VMS and OS/360 have their own linefeed characters, but who uses those? *grins*)

        You will want to pack your data instead of naively storing it as characters. This reduces the space used, which reduces the time it takes to find a given record. Packing data according to various schemes is the topic of several graduates theses, and well beyond the limited discussion here. (I would suggest googling and having several cases of Jolt available. Oh - and don't buy Doom3 when it comes out on August 6th if you actually want to get this project done.) The important thing is that most of the character packing and unpacking work has already been done for you. You just have to know where to find it.

        Your last question about data structure representation is misguided. You will not be working with the data as it is stored on disk. You will always be loading it from disk into RAM, then manipulating it there. Most metadata exists to reduce the amount of stuff that is retrieved from disk and converted into some useful data structure that lives in RAM.

        Remember - a database isn't some magical program that is somehow different from every other program in how it interacts with the world. It's just a complex application that, at its heart, does the following:

        • Stores data on disk in a compact form
        • Retrieves selected data from disk as fast as possible

        The reason this technology is so important is that retrieving selected data from disk as fast as possible is the heart of modern business. The key is selected. Every single application in the world can be viewed as a report on data stored in some datastore transformed in some way, with some giving the ability to modify that data and write it back to the datastore. The trick is only pullling back the data that the user wants to retrieve. Oh - and doing it as fast as possible, preferably beneath the human threshold (about a tenth of a second).

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: (Real) Database Design
by periapt (Hermit) on Jul 14, 2004 at 13:12 UTC
    Actually, the directory structure approach isn't a bad solution. If you can still find some literature, take a look at discussions of the internals of the old dBase II or FoxPro (DOS version). I think IBM's DB2 also uses a hierarchical directory structure but I'm not sure about that. I did a fair amount of that kind of work in the eighties.

    Although I'm a little hazy on the details now, I had a layout similar to what you described. Under a parent "db" directory, I had subdirectories for each database which contained fixed length text files for the tables. I also had batch files and other programs for data extraction and rudimentary reporting. I didn't know about SQL back then. I used fixed length records so that I could index each row based on its position in the file. Back when PC's were <4MHz reading through a file line by line could be a lot slower than a single seek to specific byte position in a file. I believe I used a single "metadata" (I didn't call it that) directory to store info about the tables. I had a subdir in the metadata directory for each database with files having the same table names containing the col structure and the like. For a while, before I discovered dBase II, I used an ISAM implementation for indexing (in BASIC no less). I believe I got it out of PC Magazine.

    Now that I think of it, PC Mag had a lot of stuff like that back then. Perhaps you could try their archives from, say, 1983 - 1987. I'm sorry that I can't give you too many specifics. My implmentations were pretty primative, I thought, but I do remember being pleased when my layout matched the dBaseII implementation.

    Good luck

    PJ
    unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)
Re: (Real) Database Design
by davorg (Chancellor) on Jul 14, 2004 at 08:12 UTC

    I'm probably missing the point here, but if you want to store data in "a relational, table-based model" then why aren't you using SQLite or PostgreSQL?

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Because I'm attempting to fully understand how such systems work. I wouldn't hesitate to use a tried and tested system if I were only looking for something to get the job done. My purpose here is 99% learning and 1% attempting to create a decent end result :)

      I've also downloaded the source code to many dbms but they're all far too complex for me to get a good idea of what's going on just by reading the source.

        I've also downloaded the source code to many dbms but they're all far too complex for me to get a good idea of what's going on just by reading the source.

        Even DBD::SQLite? I think you really should consider studying existing database systems. If they seem overly complex that's probably because building a database, even a pretty modest one, is a lot of work. If you ignore the many fine examples of the art you'll just have to discover them the hard way!

        -sam

Re: (Real) Database Design
by pelagic (Priest) on Jul 14, 2004 at 08:49 UTC
Re: (Real) Database Design
by diotalevi (Canon) on Jul 14, 2004 at 15:45 UTC
    autarch has thought of extending Alzabo so that it is not only a OO-RDBMS wrapper + transactions, it also has the data store as well. So that is near to being a database as-is.

      I wouldn't say it's near it at all. My idea is a pipe dream at the moment.

      I will say that all this discussion of the actual storage layout is somewhat missing the point. Ideally, a DBMS should be able to store data in many different ways, allowing the DBA to optimize various tables/indexes as needed. Even better, the DBMS should figure out the best way to store data based on usage patterns.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://374233]
Approved by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-23 07:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found