Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re^3: Design flat files database

by jpl (Monk)
on Jul 15, 2011 at 16:43 UTC ( #914649=note: print w/replies, xml ) Need Help??

in reply to Re^2: Design flat files database
in thread Design flat files database

You could search a few thousand bytes of directory data, even using linear search, in far less time than it would take you to access that data.
Sorry, whilst I'm no expert on *nix filesystems, I think you are wrong.
Well, I did say a few thousand bytes of directory data, which isn't going to apply to 100,000 files, 1 million files or 100 million files. I was mostly trying to move the OP away from the directory-per-digit option. If the IDs are used to cross-reference messages (as they are for messages here in the monastery), then a database, rather than flat files, is even more compelling. I don't know how messages are stored in the monastery, but I strongly suspect it is via a database, not in flat files. You wouldn't want to run monastic searches against unindexed flat files, but that would be relatively easy to implement (efficiently) in most databases.

I have recently been trying to nudge the OP in the direction of databases, and that's a nudge I see reflected in many of the responses.

Replies are listed 'Best First'.
Re^4: Design flat files database
by BrowserUk (Pope) on Jul 15, 2011 at 17:14 UTC
    I have recently been trying to nudge the OP in the direction of databases, and that's a nudge I see reflected in many of the responses.

    Indeed. I asked a similar question.

    Why are you settled upon a "flat file database" rather than one of the other options? (RDBMS, HADOOP, NoSQL etc.)

    That said, RDBMSs are pretty shite at handling hierarchal datasets, whereas file-systems are explicitly designed and tuned for exactly that. It would be an interesting exercise to compare the response times for the two using identical, threaded datasets. But then again, neither scale well.

    Facebook apparently use hundreds of sharded MySQL instances ensconced behind 1000s of memcache instances with more (PHP!?!) caching in front of that. They seem to make it work, but it sounds like a disaster waiting to happen to me. But we can probably assume that the OP isn't likely to be requiring that scale of things anytime soon.

    One nice thing about using the file-system is that it is relatively easy to scale it out across multiple boxes, by partitioning the ID space to pretty much whatever level is required. Raided disks in each box take care of your hardware redundancy and each box trickles off updates in the background to remote off-line storage. Far easier to partition and manage than distributed RDBMSs and no coherency problems.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://914649]
shmem spent 3 hours chasing a bug related to variable scoping
[shmem]: is there an equivalent to my in python? or 'use strict' ?
[LanX]: I doubt
[LanX]: only nonlocal in Py3
[shmem]: so, python sucks
LanX NSFW!!!

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2018-03-20 17:53 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (256 votes). Check out past polls.