Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Large chunks of text - database or filesystem?

by TedPride (Priest)
on Mar 19, 2005 at 08:40 UTC ( #440889=perlquestion: print w/replies, xml ) Need Help??

TedPride has asked for the wisdom of the Perl Monks concerning the following question:

This is probably a stupid question, but what's the best way to store text from forum posts? Assume that there's a large number of posts, that text size may vary from a few lines to as much as 10K or so, that the maximum number of files allowed inside a folder is 1024, that efficient searching is a consideration. I realize that with a double-layered folder system and 16-bit hashing, I can store probably around 33 million files without seriously affecting lookup time (256 * 256 * 512 - I'm allowing a 50% margin), but backing up or transferring these files would be a major pain, and I'd have to have at least one merged file for searching purposes anyhow. But what about database searches? Seems to me that there I'd also have to have a marged copy of the text there in a processed (punctuation and extra spaces removed) format, and just searching the text as it resides in the database won't be much help. So it's basically a matter of storage efficiency. Will storing the posts in the database instead of as individual files take up a lot more space, if so how much, and which is more efficient for lookup / edit? Which would you pick and why?

I know the following node asks pretty much the same question, but there isn't enough information, specifically on the issue of storage space for database vs filesystem, for me to make a decision.
http://www.perlmonks.com/?node_id=433891

Incidently, what if I stored all the post text in a single file outside of the database, using fixed size sectors and just storing the sector start and number of sectors used for each post in the database? Given a reasonably well-designed sector size (factor or multiple of disk sector?) I could reuse sectors from "decayed" posts (new post needs 3 sectors, look for decayed post with 3 sectors) without wasting large amounts of disk space or significantly increasing lookup times. Is this worth the trouble, or does mySQL internally do the same sort of thing for large chunks of text? In other words, if the max size of the text is 64K, and the post is actually only 2K, is the extra 62K wasted? Would it be better to use, say, 8K sectors and only waste 6K?

  • Comment on Large chunks of text - database or filesystem?

Replies are listed 'Best First'.
Re: Large chunks of text - database or filesystem?
by bart (Canon) on Mar 19, 2005 at 10:43 UTC

    I can store probably around 33 million files without seriously affecting lookup time (256 * 256 * 512 - I'm allowing a 50% margin), but backing up or transferring these files would be a major pain

    Gee, I don't know, it would seem that making incremental backups would be easier than with a normal database.

    But what about database searches? Seems to me that there I'd also have to have a marged copy of the text there in a processed (punctuation and extra spaces removed) format, and just searching the text as it resides in the database won't be much help.
    I doubt if it'd help much. grep is a very efficient way to search inside a file, and its regex syntax seems flexible enough to ignore punctuation and whitespace.

    But you definitely should be thinking about about indexed searching, where you make a reversed index of important keywords and a list of every post where it can be found.

    Will storing the posts in the database instead of as individual files take up a lot more space

    As a general rule, databases are wasteful with space. They all seem to be designed as if disk space was free: if your disk isn't large enough to hold the data, you can just throw more hard disks at it. So no, I don't expect databases to be space efficient, or even recycle wasted disk space by itself. Whatever you do, your home grown solution will likely be a lot more compact.

    But be prepared to put a lot of effort in it. To make a decent database-like system from scratch is a lot of hard work, requiring a lot of nifty home-grown solutions, many of those which are simply part of a standard database system.

Re: Large chunks of text - database or filesystem?
by cyocum (Curate) on Mar 19, 2005 at 10:09 UTC

    I would use a database for your problems. The reason? Ease of transfering the information from one machine to another. All the information would be stored in a few files and can be transfered to a new database much more easily. Hard drive space is releatively cheap these days and having millions of files all over the place is ineffiecant in my opinion (as always, I could be terribly wrong).

    For effecient searching, I would use Plucene. Internally, MyISAM tables do have the ability to compress but only for read-only tables. I would run Plucene as a cron or scheduled task or have it triggered by an insertion. Do not worry Plucene (at least as far as I know for Lucene from which Plucene came) should store partial indexes in memory before flushing to disk so it should be pretty quick about additions to the index.

    About the storage requirements for text, check the MySQL documentation here. That should let you know how much disk space you are wasting with each post.

    I hope this helps.

Re: Large chunks of text - database or filesystem?
by jhourcle (Prior) on Mar 19, 2005 at 15:34 UTC

    You are still missing too much information to make a decision

    1. What is the maximum posts per minute you expect?
    2. What is the maximum reads per minute?
    3. How many overall posts per day? (estimated growth rate)
    4. How large do you expect the messages to be?
    5. Is it a write once system, or will there be re-editing of messages?
    6. What is your hardware budget for the project, or is there fixed hardware?
    7. What is the required uptime?
    8. Are you going to have an internal search engine?
    9. If so, what sort of information are you going to search on? (metadata, or the message itself?)
    10. What are your disaster recovery requirements?
    11. Do you need to support transactional concurency?
    12. What are your time constraints?
    13. Do you already have a database to use for this purpose?
    14. Do you already have experience with databases?

    Moving lots of files around is not a problem. Tar and rsync are your friends. The only problem with files comes when you're trying to work with more files at the same time than your OS supports. Databases for file storage are basically just ways of getting around those problems, and keeping extra metadata catalogs on hand to find the required information in a more efficient manner.

    Depending on just what the requirements are, I might go with the file system for the message bodies, and a database for the metadata (posting time, who posted, thread tracking, etc). I might also go with a heirarchical database, rather than a relational database, if that fit well with the anticipated characteristics. I might also look at repurposing an NNTP server, rather than starting from scratch.

    Personally, I wouldn't optimize for storage space, unless you're not expecting anyone to read the posts. I'd optimize for reads/writes. Depending on the nature of the forum, I might have an aging system, that moves the entries from a read or write optimized system to an alternate storage mechanism for long-term storage.

Re: Large chunks of text - database or filesystem?
by TedPride (Priest) on Mar 19, 2005 at 13:05 UTC
    I don't have a lot of leeway in what I can do, though. This would be for a remote hosting account without access to cron (though I can fake that) or huge amounts of disk space (probably around 300-500 MB). Assuming the worst-case scenario of 300 MB and around 65K per post (64K for blob + necessary fields), we're talking a max number of around 4600 posts, which isn't in my opinion acceptable.

    Does a blob require the max amount of space regardless of the size of its contents?
    If yes, is there a way to specify a fixed length field of more like 10K in size?

      Assuming MySQL, BLOB types are variable length. The length you store + 1, 2, 3 or 4 bytes to store the length.

      Column type Storage required CHAR(M) M bytes, 1 <= M <= 255 VARCHAR(M) L+1 bytes, where L <= M and 1 <= M <= 255 TINYBLOB, TINYTEXT L+1 bytes, where L < 2^8 BLOB, TEXT L+2 bytes, where L < 2^16 MEDIUMBLOB, MEDIUMTEXT L+3 bytes, where L < 2^24 LONGBLOB, LONGTEXT L+4 bytes, where L < 2^32 ENUM('value1','value2',...)1 or 2 bytes, depending on the number of en +umeration values (65535 values maximum) SET('value1','value2',...) 1, 2, 3, 4 or 8 bytes, depending on the num +ber of set members (64 members maximum)

      Which should ease your concerns a little. You could improve things a bit by zipping the data on the way in, and unzipping it on retrieval, but that comes at the cost of being able to use the (somewhat slow and primitive) SQL search facilities.

      Then again, you could index (using perl rather than the DB), the words in each post when it is created, which would allow you to locate the posts via keywords very quickly. You could store the index in the DB either as Storable hash or as a table, though the latter might be costly in terms of space and time.

      And, once you have built an index of the words in a post, why store the words? If you fully invert your posts, you could probably get away with storing a 24-bit number to represent the "words" in the posts", which would probably represent an overall compression as you would be discarding the spaces.

      The same file/table/frozen hash that you use to relate index entries to words, could also store post numbers where the words appear. You now have the basis of fairly well compressed, highly indexed datastore the allows very fast search and retreival.

      Just musing :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?
      BLOBs are variable length (at least in MySQL, blob storage requirements). There are also MEDIUMBLOBs which hold up to 16MB and LONGBLOBS which hold up to 4GB, all as variable length fields requiring at most 4 bytes more than the data to store the length of the data.

      If you happen to be fortunate to have a hosting account with MySQL v4.1.1 or greater you might be able to use the (de)compress() function to (de)compress the data on the fly in MySQL. Or for that matter, you could just compress it using perl before inserting it in the table.

      For searching, plucene is a really cool product but I've only used it in relatively small settings (i.e. indexes of under 30 MB). I've heard differing opinions on plucene's speed so it my not be an acceptable option for your case. MySQL also offers full-text searching functionality since v3.23.23 but that would prevent you from doing compression of the data.

      As another alternative, I recently heard of another effort (still in alpha) that was inspired by Plucene but with the goal of overcoming Plucene's performance problems. It's called Kinosearch. I have no actual experience with it but it appears to be promising (I'd be interested in hearing from anyone else that's actually used it).
Re: Large chunks of text - database or filesystem?
by TedPride (Priest) on Mar 19, 2005 at 23:47 UTC
    1. What is the maximum posts per minute you expect?

    I honestly don't know. The forum isn't currently that busy, but it's an old modified WWWBoard system and probably isn't getting many uses for that reason. The entire site can get like 1400+ visits daily, and I'd be hoping to improve that with a restructure of the site.

    2. What is the maximum reads per minute?
    3. How many overall posts per day? (estimated growth rate)

    Again, there's no way to estimate at this point how many reads or writes there will be.

    4. How large do you expect the messages to be?

    Most under 2K, some as large as 5K. I doubt we'll need more than that, though the capability should be there.

    5. Is it a write once system, or will there be re-editing of messages?

    Re-editing.

    6. What is your hardware budget for the project, or is there fixed hardware?

    Remote hosting account, with fixed hardware. Though if we had to, we could upgrade to us having the entire server to ourselves. That would only be if the site in general got a good deal more popular than it is now, however.

    7. What is the required uptime?

    I wouldn't want it down for more than 5 minutes a day, at most.

    8. Are you going to have an internal search engine?

    Yes.

    9. If so, what sort of information are you going to search on? (metadata, or the message itself?)

    Message itself. Ideally, the messages would be preprocessed to lowercase everything and remove unnecessary punctuation, and there'd be a small index for the most popular keywords. The search data doesn't have to be real-time - it can be generated every day or two, if necessary.

    10. What are your disaster recovery requirements?

    The site has regular daily backups as part of the hosting service, and we can get a restore if we destroy something by mistake.

    11. Do you need to support transactional concurency?

    I'm assuming there will only ever be one person editing or deleting a specific post. We might want to generate the threads as web pages, however (from accumulated post data), and these would need some form of locking / unlocking so that two people posting to a thread or editing a thread wouldn't conflict.

    12. What are your time constraints?

    At this point, none. I'm willing to spend a lot of time if necessary to get an efficient system going that will last a long time.

    13. Do you already have a database to use for this purpose?

    We have a mySQL database. I don't know what version of mySQL, however.

    14. Do you already have experience with databases?

    I've used mySQL a fair amount, though not much with Perl and never to store large amounts of text.

      No one has asked the obvious question: Why are you doing this? I once tried to write some customized forum software and quickly realized that it'd be more trouble than it was worth. Why not go with one of the many available solutions?
        Not all of the shelf software integrates well or has good internals. If this is an integrated piece of software, then doing it by hand isn't hard. If it isn't, it's usually a hack to get them seperated. Heck, forums aren't hard to write. User accounts, posts, replies and a listing. Depending on what inefficencies you run into, you adress them, like caching certain info.

      Well, I guess it's making sense why you're trying to optimize for space if you're running from a hosting account... but I'd still have to ask why that's your main consideration. I'm guessing that if you're trying to get more people to use your site, you're going to need to add stuff for the users, not for you. (and your users probably aren't going to be discussing the benefits of data storage techniques ... well, the might be, after all, I'm doing it right now).

      Given what you've mentioned, I'd probably store the metadata in a database, just because I would feel comfortable coding with that backend. However, I would try to make sure that I/O is encapsulated, so that I could change the storage system at some later time, if I wished. (or I could recycle it to be used elsewhere)

      But I'm going to have to agree with your decision to stop using WWWBoard. I mean, it's 19105, and you're running a program which tells you to chmod a directory to 0777. It was 'ast modified' in 1995, and is still in 'alpha'. (although, the copyright is marked 2002... I guess he knows what the important things are to update.) Hmm...okay, I should probably stop flaming Matt's Script Archive, or this post will go on for pages. -- I will give him credit that it was very nice for him to want to share with the public... and that's about the only nice thing I can say about it.

Re: Large chunks of text - database or filesystem?
by TedPride (Priest) on Mar 20, 2005 at 04:47 UTC
    Using a generic forum has the following pros and cons:

    Pros:
    It's easy to set up
    It's more or less bug-free
    It has a vast number of features

    Cons:
    It's not easy to add new features
    It's not nearly as efficient as it could be
    Everyone else is using it, so you don't look original

    It'll probably be easier in the long run to roll my own. And more educational.

    EDIT: I would use a generic forum for a small site, but a larger site like ours is going to get regular spam attacks and a heck of a lot more posts, and I need to be able to get more hands-on with the inner workings. I can't afford to take a week to fix an immediate problem.

    EDIT: In Matt's defense, he did update WWWBoard to fix security holes, though the script still only works for single boards and still looks ugly. I had to rewrite like half the code to add multi-board support, a better design, more elaborate error messages, and checking against lists of banned words and IPs. It was my first introduction to Perl.

      Don't you wish you had known about the Albert Fong Device? ;) I guess not, because then you would never have learned Perl :p
Re: Large chunks of text - database or filesystem?
by mattr (Curate) on Mar 22, 2005 at 16:10 UTC
    My 2 cents having written forum software for a popular website and seen the wwwforum etc. (though I don't remember it too well right now).

    If you are doing it from scratch, keep it simple, powerful and flexible. If doing it all over again I'd do it in Mysql (which I know). It is easy to back up, it's easy to add information about parents/children for threading, you can do a backup with a single command, you can use LIKE to search it. I've used filesystem-based things in the past and when (well this is a while ago) they didn't blow up the box or fail to port to another OS, etc., it was hard to guarantee integrity (who will miss a few files that got deleted by accident) and conceptually difficult to access. Sure you can lift weights but a database is easy.

    Recently I used Class::DBI and mysql to build an appointment tracking system with messaging sort of like forums. Only possible with a DB really. If you need to add something to a textfile hierarchy it gets kludgy.

    Storage efficiency isn't such a big deal I don't think, because of variable field lengths in a database (VARCHAR or VARCHAR BINARY or BLOB, read the online manual at mysql.com).

    The speed overhead might definitely be an issue with say Class::DBI (probably okay for a simple forum).

    Be sure to spend much time on your admin interface so you can take things off that are inappropriate and I also had to search for "badwords". Even ban users and log ips. Think about using HTML::Template and modularize so you aren't writing the same display code 4 different times. Don't put HTML in heredocs in the code. I'm saying this because I started with someone else's code and tore most of it apart. It was badly done and I even had to rewrite the file saving/locking. Do yourself a favor and start clean and secure.

    To save work I made the admin log in to get an admin cookie set to let him edit the live threads, it worked.

    Also I think that over time you will need to retire posts or maybe even archive by month or season. Say you have 10,000 posts, you don't want to tell people you have 200 pages to surf and show that many navigational links right? With a DB it is easier to sort by date because it's orthogonal. Even paging is easier (a Pager for clasists). With text files you are going to be building all kinds of hash index files and it will be a mess in your head all the time. Go with a DB and stop thinking about all this nitty gritty. You need to focus on the admin and real world issues that come up which unless you are really experienced with this will likely hit you like a sack of bricks. (If you have run a busy IRC channel then maybe you already know a bunch of this).

    Finally you probably want to hide email addresses but keep them in the DB. For example we had the star of the site, a sort of Martha Stewart of Japan (well now she is, wasn't a few years ago), email people back if she wanted to do so. Also if I remember right our board allowed a user to store a "save password" that s)he could use to delete his post (well maybe I tore that out). Of course the hierarchy would have to be mended. So clients might ask you for this kind of thing one day.

    Also do a lot of user testing before the launch. Please. All kinds of little thing will bite you.

Re: Large chunks of text - database or filesystem?
by TedPride (Priest) on Mar 20, 2005 at 23:44 UTC
    I don't think it was available when I set up the board system originally. There were versions of WWWBoard with expanded features, such as multi-board support, but all the good ones cost money and I'd have had to modify them further anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://440889]
Approved by Corion
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2022-01-28 19:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (73 votes). Check out past polls.

    Notices?