http://www.perlmonks.org?node_id=433891

BrentDax has asked for the wisdom of the Perl Monks concerning the following question:

Sorry if this is somewhat off-topic, but I'm not sure where else to ask this.

I'm working on a site that involves users posting formatted text documents. They'll actually be using POD with a different L<> syntax for the formatting--it's simple, safe (with =begin/=end/=for disabled), and does what's needed. The structure of the site is something like an Everything Engine Lite--not everything's a node, but it's extremely database-driven.

At this point, I don't know how how heavy the load will be, but it could potentially be very popular, so I'm trying to make sure it can handle significant loads. To that end, I'm parsing the POD only once (when a document is posted) and storing the HTML version for later.

My question is, will it be more efficient to store the HTML versions of these POD documents in the database or in a separate file? (Realize that if it's on-disk, I can read it in line-by-line, whereas I believe keeping it in-database would essentially mean slurping everything into memory.) These documents will probably range between 2-100KB; I can probably cap them at 65K (the size of a MySQL TEXT column).

Thanks in advance for any advice.

=cut
--Brent Dax
There is no sig.

  • Comment on Large chunks of text: in-database or in-filesystem?

Replies are listed 'Best First'.
Re: Large chunks of text: in-database or in-filesystem?
by Tanktalus (Canon) on Feb 24, 2005 at 00:42 UTC

    Once I went part-way to a database, I would go all the way. There is really no point in having the overhead of multiple data interfaces.

    Yes, a database is usually slower than the filesystem. However, the difference is that a database is scalable. It handles concurrency and scalability for you automatically. (Well, I'm used to using commercial db's .. I'm not sure where mySQL or PostgreSQL stack up here - not a flame, I honestly don't know.)

    Let's say your website got really, really popular. You want to handle the load better. Upgrading hardware is one - it means you need to backup and restore to the new machine - but you also have to make sure you get all your extra files across, too. Instead, you may just want to add a second machine to the fray, and use IP round-robin to spread the load. (Or any other method of spreading the load.) I'm also presuming you spring for gigabit ethernet to connect the boxes to each other on a private network - your regular internet connection should not see any traffic between your machines.

    DB & filesystem

    You're looking at a number of options. I'm going to deal with the filesystem first, because the DB will be dealt in the DB-only section.

    1. Replication of files from one node to another. This may be done via rsync, but it means some files won't be visible to one node until the next rsync. Risky.
    2. Share via NFS. NFS isn't exactly the most reliable software out there, but then again, this is HTTP we're serving over anyway. However, the NFS server is going to get hit hard. Every read, every write, goes over NFS to the server. The speed is likely to be comparable to the database server now.
    3. Share via NAS or SAN. Both machines access the files directly. I'm not entirely sure how locking works on these ... presumably it works the same as if it were local. At this point, you'd put your database on the NAS or SAN, too. Expensive, though.

    Upgrading again, you may make one machine both an NFS server and a DB server, and two machines are acting as web/cgi servers. Which may mean more moving around. You're running two servers on this machine (NFS,DB). It's not really sounding compelling to me.

    DB-only

    Put everything on the db. One machine acts as DB server, the other as a client, both as web servers. You can scale this as much as you want - create a cluster of DB servers that act as a single server, and a cluster of web servers that talk to the DB server cluster. Nearly unlimited scalability here. Put your DB on big iron if you want/need. Secure the whole thing by closing down unneeded ports, including NFS.

    You can control the web servers as completely independant servers from each other, and control the DB server(s) as completely independant from the web server. Even if they're on the same machine.

    To me, the scalability of the database is the clear-out winner. It's not even a contest on short-term efficiency.

    Disclaimer: I don't do this for a living :-)

Re: Large chunks of text: in-database or in-filesystem?
by ikegami (Patriarch) on Feb 23, 2005 at 23:55 UTC

    In terms of cons, (off the top of my inexperienced head)

    If you use files, you have twice as many things to manage. Where you had database-handling code, you need to add file-handling code. If you delete a records, you also have to delete a file. You have to backup both the database and the files. If you had load sharing, you have to share the database and the files. In other words, twice as many things can go wrong. Also, a database adds (limited) searchability.

    The cons for using the database might include increased database workload, and data size limitations.

Re: Large chunks of text: in-database or in-filesystem?
by mr_jpeg (Beadle) on Feb 23, 2005 at 23:26 UTC
    What are you going to need?

    On one hand, keeping files in the filesystem will always be faster. You don't have the overhead of the database layer.

    On the other hand, keeping the file in the db buys you the ability to search quickly. That's pretty dang useful.

    As far as accessing the text data, it doesn't sound like either method is a dealbreaker. 65K isn't /that/ much.

    --
    jpg