Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Where is the bottleneck in flat file databases?

by Wysardry (Pilgrim)
on Jan 22, 2003 at 23:19 UTC ( [id://229158]=perlquestion: print w/replies, xml ) Need Help??

Wysardry has asked for the wisdom of the Perl Monks concerning the following question: (database programming)

I've seen it mentioned time and time again that flat text databases become very inefficient as the size of the file approachs 1Mb.

Does anybody know why that is and, more importantly, any way to lessen that effect?

It can't really all be down to the file system, as an RDBMS uses files on the same system in many cases. For instance, I have Perl and MySQL installed on the same drive.

Is it the indexing, or is it just down to the queries being controlled by a specialised compiled program instead of Perl? Is that really enough to allow MySQL to handle tables 5000 times the size available for a flat text version?

Has anyone experimented with indexing flat file systems in Perl and using seek to query smaller blocks at a time?

Originally posted as a Categorized Question.

  • Comment on Where is the bottleneck in flat file databases?

Replies are listed 'Best First'.
Re: Where is the bottleneck in flat file databases?
by MarkM (Curate) on Jan 23, 2003 at 04:45 UTC

    The two primary problems with flat file (text) databases are:

    1. Fields are not usually indexed.
    2. If the flat file uses variable length records, updates to existing records that require the record length to change require a re-write of all records below the record being updated.

    Fields in a flat file database can be indexed. Either the index can be stored in the flat file database, or in an adjacent file in a known location. However, this is not usually done, as the effort of writing routines to index a flat file database can likely be better spent using an existing binary database format, or toolset.

    An additional issue with the requirement that much of the database be re-written if updates are necessary, is the problem of contention. What if multiple processes need to update the database? If process A needs to update the database, it must lock the database, re-write a large section of the database, and unlock the database. Process B needs to lock the database, re-read the database to ensure that it incorporates Process A's change, then re-write a large section of the database including Process B's change, and so on.

    In general, flat file databases should only be used if they are very small (I would say less than 100Kbyte, not less than 1Mbyte), and if they only need to be updated by one process, and read by others. The 100Kbyte number is a question of fitting the database content into cache memory, as even having to fallback to RAM can be quite expensive.

    I have had a great deal of success with one process writing common data to a text file, and many (500+) clients reading from the text file. The trick here is to ensure that the operation is atomic from the perspective of the client. The client must always see a complete file at the configured location. Also, the client should not rely on the data being fully up-to-date. From a network perspective, for a small amount of data, it is far more efficient to use a network protocol such as NFS to serve a static 'flat file database' to clients, than to serve the data from a generalized database server such as MySQL.

      I think that one of the reasons so few programmers bother to try and index flat file databases is that they view them as a temporary measure until they can implement a "real" DBMS.

      Your points are well taken, but I have specific reasons for wanting to use flat text files, and it would be beneficial to make the system as efficient as is practical, given the limitations you describe. Some users may well expand on the basic program, so I wish to create a resilient core.

      In this particular case, the database itself would only ever be modified by a single administrator. Modifying existing records would likely be a case of reading in the existing file a record at a time and writing them to a new (modified) file. Once that process is complete, the original file would be renamed to become a backup, and the new file renamed to replace the original file.

      This solves the problem of having to fit the entire database in memory, and also creates a backup. File locking would still be a good idea, of course.

      Your idea of using MySQL to create flat text files is intriguing (and one I may consider for future projects), but unfortunately the program is intended for users who do not have access to an external RDBMS.

      __________
      "Every program has at least one bug and can be shortened by at least one instruction -- from which, by induction, one can deduce that every program can be reduced to one instruction which doesn't work." -- (Author Unknown)

Re: Where is the bottleneck in flat file databases?
by Ryszard (Priest) on Jan 23, 2003 at 16:27 UTC
    Scalability, maintaince and integrity.

    Scalability Unless you've got a very tightly scoped project, the inherent creep in data model will invariably lead to your data set growing. If you want to follow any type of normalisation of your data it means having multiple text files. This type of design is not going to scale very well.

    Maintaince Adding records, removing records and maintaining the integrity of your record structure can be difficult. You've got two clients requesting access to the same record, both writing, which should go 1st? we're talking about locking. How do you manage backups? restore a complete file? what if its been updated? we're talking transaction logs..

    Integrity What character do you use as a delimiter? and what if a user wishes to use it for a legit purpose? what about a data model which says only values 'a, b and c' are valid, and a user enters d? This kind of things can be handled very easily using foreign key constraints in an RDBMS.

    Any serious designer will tell you an RDBMS is the best way to go in terms of building data models (assuming you're data model is relational). You've got all the access control, the transactions, the ability to fundamentally normalise your data, its all there, and, its all available freely.

    Now all the negative stuff is out of the way, there are instances where a FF may be handy. Read only as MarkM pointed out is probably the best use of flatfiles.

      I agree with your points against using flat file databases, but in this case those drawbacks will have little effect.

      In this case I'm hoping to improve on an existing program that is widely used to create web sites. As such, the schema is unlikely to change radically once defined.

      The database will only ever be modified by one person, so there shouldn't be any problems with multiple users writing to the same file. Backups will be made each time a record is modified as described in my reply to MarkM.

      Which character to use as a delimiter can be define in a separate configuration file. As the data will be use to create web pages, if the user legitimately needs to use the same character in the text of their page they can use an HTML entity instead.

      I agree that a dedicated RDBMS is usually the way to go, but not everyone has the luxury of having access to one (many free hosts still don't offer MySQL).

      Text files can be an advantage to this system if one or more of the following applies:-

      • The user is on a budget
      • A suitable RDBMS is not available
      • The user has little experience with using databases
      • The data is only ever changed by a single person
      • The user can only execute the script via a hosting service

      Storing the data in plain text files is a big advantage in the latter case, as the user has the option to modify the database using a standard text editor.

      __________
      "Every program has at least one bug and can be shortened by at least one instruction -- from which, by induction, one can deduce that every program can be reduced to one instruction which doesn't work." -- (Author Unknown)

Re: Where is the bottleneck in flat file databases?
by cavac (Parson) on Jul 15, 2012 at 20:28 UTC

    Additional info:

    "Classic" databases do their own data managment within a reserved space (usually called a "tablespace") on disk. Internally, this might not differ very much from how a filesystem works, where the data and directory listings (e.g. table rows and indices) are organized in some sort of tree. When you insert, update or delete data, only a small part of it needs to be rewritten.

    Classic databases are usually developed over a number of years, get tweaked, tuned, optimized, critical loops rewritten to low level high performance code and such.

    Last, but not least, many databases are also optimized to cache the most critical and most used data and indices in RAM.

    Unless there are very specific project requirements (compatibility with other programs, customer requirements), you should take a look into using a "real" database.

    While there are many "big" databases out there, like PostgreSQL, MySQL, Oracle, Microsoft SQL Server, there are alternatives that you might find useful, too, depending on many factors.

    For example, if you want a light, portable system where you can just copy the data file to another computer, DBD::SQLite may be an alternative.

    If you are doing mostly key/value stores (where the "values" can be complex datasets as well), you could also take a look into NoSQL databases like CouchDB.

Re: Where is the bottleneck in flat file databases?
by erichansen1836 (Sexton) on Jun 29, 2017 at 12:48 UTC
    See discussion Flat File Database, and dbm.

    I have used Flat Files for enormous "text" data storage in fixed-length records, and with the indexing to those records done by using binary SDBM files of key/value pairs tied to Perl program Hash tables. The "value" is the record offset to seek to for random access. I normally have about 5 million records in each Flat File, and in each SDBM file where the Flat File is just under 4GIG and the SDBM file about 1 GIG in size.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://229158]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-24 19:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found