Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
The issues in optimizing performance for data storage are primarily hardware/operating-system driven. A partial list of the issues/thoughts/concerns would include:
  • A hard-drive generally has seek-times in the milli-second range. RAM has seek-times in the nano-second range. That's a million-to-one difference.
  • Most datastores, like MySQL, are optimized for data retrieval. This is because the number of retrievals vs. the number of updates is generally in the 100k-to-one to million-to-one category. (That's based on experience and anecdotal evidence. Various applications have various needs. To compare - I work with two apps right now. One is almost completely retrieval and the other is about 100k-to-1.)
  • Although the SQL standard doesn't mandate data types (and SQLite, for instance, doesn't use them), most datastores will use them because that information can reduce the amount of space taken for a given row. By reducing space, you reduce the seek time for finding records. For example, if you know a column is INT(7), that means it's less than 2^24. That's 3 bytes storing it as an integer, instead of 7 as a string. (And up to 8, if you store it as a null-terminated string.) Over a million rows, that can save 4 megabytes. In addition, if you use an integer there, you can justify using fixed-length rows, which trades space (you potentially use more per row) for speed (you know exactly how wide each row is, so you only have one seek vs. potentially thousands in a naive implementation.)
  • A really neat fact about datastores is that optimizing for space actually ends up optimizing for speed, in the general case. This is because datastores are almost completely I/O-bound. The less you have to use the hard-drive, the less time you take. Also - CPU's are improving much much faster than hard-drives. (At least, over platter-based hard-drives.) So, using some form of data compression on a per-row may make sense. (And, some datastores like Oracle and MySQL do offer this option for retrieve-only tables, but that system can be easily improved upon.)
  • A concern you have to take into account is the block size. This is the smallest chunk of physical space the operating system will allocate for storage. In most OS'es, this is either 512bytes or 1024bytes. In keeping with the desire to reduce the amount of seek time, you should be aware of this number.

Your questions are actually the same question, from different directions. There is a better way of structuring the data, and that's by keeping the metadata around. Basically, the most important piece of metadata you want is the rowsize. That tells you where the data is that you're looking for. You also want to keep some set of indices which allow you to quickly determine which row numbers have what data in what column. As for storing this metadata ... most datastores keep a separate storage area for this. Oracle keeps it as part of the data file and MySQL keeps it as a separate file (at least for MyISAM tables).

Column types, sizes, referential integrity ... that's all used to aid the developer. SQLite is a good example of a datastore that doesn't make any use of that stuff. (Well, very, very little use.) Column types and sizes can also help the datastore in some optimizations when calculating rowsize. (q.v. above as to why this is important.)

Oh - if you want any sort of decent performance, you will end up rewriting this in C.

This won't help in the actual implementation, but look for stuff written by Fabian Pascal on some theory behind efficient data storage, especially with the relational model. He has a lot of ... good ... articles on the web.

Updates:Wording changes on the sidenote about compression on a per-row basis. Added an example of savings when using data types.

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

I shouldn't have to say this, but any code, unless otherwise stated, is untested


In reply to Re: (Real) Database Design by dragonchild
in thread (Real) Database Design by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-24 08:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found