Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: Design flat files database

by jpl (Monk)
on Jul 15, 2011 at 10:55 UTC ( #914545=note: print w/replies, xml ) Need Help??

in reply to Design flat files database

If you are using spinning disks (not static ram) to store your files and directories, then a useful rule of thumb is that a 7200 rpm disk spins 120 times a second, so each byte rolls by every .0083 seconds. On average, you can do no better than 4 milliseconds to fetch the byte(s) you are after in a random access. (You can get a lot of associated bytes with the same read, so data density helps in transfer time, but not at all with disk latency). With processor cycle times in the neighborhood of a nanosecond, you can execute a lot of instructions in 4 milliseconds. You could search a few thousand bytes of directory data, even using linear search, in far less time than it would take you to access that data. So don't make your directory hierarchy too deep. Each subdirectory is going to cost you at least 4 milliseconds to read. A few levels may get cached, but that's equally true of shallow hierarchies. Keep disk accesses in mind when you design your system.

Replies are listed 'Best First'.
Re^2: Design flat files database
by BrowserUk (Pope) on Jul 15, 2011 at 15:47 UTC
      You could search a few thousand bytes of directory data, even using linear search, in far less time than it would take you to access that data.
      Sorry, whilst I'm no expert on *nix filesystems, I think you are wrong.
      Well, I did say a few thousand bytes of directory data, which isn't going to apply to 100,000 files, 1 million files or 100 million files. I was mostly trying to move the OP away from the directory-per-digit option. If the IDs are used to cross-reference messages (as they are for messages here in the monastery), then a database, rather than flat files, is even more compelling. I don't know how messages are stored in the monastery, but I strongly suspect it is via a database, not in flat files. You wouldn't want to run monastic searches against unindexed flat files, but that would be relatively easy to implement (efficiently) in most databases.

      I have recently been trying to nudge the OP in the direction of databases, and that's a nudge I see reflected in many of the responses.

        I have recently been trying to nudge the OP in the direction of databases, and that's a nudge I see reflected in many of the responses.

        Indeed. I asked a similar question.

        Why are you settled upon a "flat file database" rather than one of the other options? (RDBMS, HADOOP, NoSQL etc.)

        That said, RDBMSs are pretty shite at handling hierarchal datasets, whereas file-systems are explicitly designed and tuned for exactly that. It would be an interesting exercise to compare the response times for the two using identical, threaded datasets. But then again, neither scale well.

        Facebook apparently use hundreds of sharded MySQL instances ensconced behind 1000s of memcache instances with more (PHP!?!) caching in front of that. They seem to make it work, but it sounds like a disaster waiting to happen to me. But we can probably assume that the OP isn't likely to be requiring that scale of things anytime soon.

        One nice thing about using the file-system is that it is relatively easy to scale it out across multiple boxes, by partitioning the ID space to pretty much whatever level is required. Raided disks in each box take care of your hardware redundancy and each box trickles off updates in the background to remote off-line storage. Far easier to partition and manage than distributed RDBMSs and no coherency problems.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Design flat files database
by AlfaProject (Beadle) on Jul 15, 2011 at 11:22 UTC
    that's the point , I am searching for sweet spot for how much i need go deep?
    also another thing i was thinking on ..each time user folder accessed , filesystem need to go like :
    what if i will put a link to DB in the root directory for faster access ? What do you think about that?
    /user_db/1/2/3/123 Thanks

      If your system is busy, the top level directory is likely to end up in buffer cache, no matter where it is rooted. (If it's not busy, a few extra milliseconds won't matter). Take a look at directory sizes for 1-digit/2-digit/3-digit prefixes. A "sweet spot" would be directories that (just) fit into whatever block size your filesystem uses, often 4KB. I'm guessing that will be 2- or 3-digit prefixes, depending on how dense the prefixes are and how the file system structures directories. You might let the top level directory get a bit larger, on the grounds that frequent access will keep it pinned in buffer cache.

      Don't go nuts with premature optimization. Make it easy to alter the structure of the directories, like having a routine to return an array of components corresponding to the directory entries. Then measure performance with a few alternatives.

        nice! Thanks ! What about opening a file ?
        It's really matters if I store replies for posts in 1 file or file for each reply ?
        In most cases they will be grabbed together from the database, but sometimes will be edited or removed by users
        I mean if the open file action is slow like directory access or it's fast ?
        And final question :) If it's really better to make user folders with id's and the username to store inside his data file
        just to use username as a directory name ?
        Thanks a lot! This forum rocks :)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://914545]
[choroba]: Good morning!
[Corion]: I hope you had a good weekend!
[marto]: jetlag has really done a number on the kids, it's been a tough week
choroba played with the band on Saturday, so Sunday was very sleepy
[choroba]: managed to release Syntax::Construct with 5.28 support in the night, though
[Corion]: choroba: Whee ;)
[Corion]: marto: Ouch - I would've thought that kids adapt much better, but that's obviously not the case...
[marto]: well, their mother let them sleep till 15:00 & 12:00 last week, which didn't help them adjust :P
[Corion]: I was "productive" over the weekend in the sense that I revived my old "Perlmonks on SQLite" code, which likely means I can get a test instance back up running on my webhost. Small steps :)
[Corion]: marto: Ow, no, that doesn't help at all :)

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (13)
As of 2018-06-25 08:37 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.