Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
You could search a few thousand bytes of directory data, even using linear search, in far less time than it would take you to access that data.

Sorry, whilst I'm no expert on *nix filesystems, I think you are wrong. At least as far as ext2 goes; and ext3 used in its default manner.

The problem is that in the former, and by default in the latter, the filename to inode mappings stored in the directory files are a single linked list that must be searched from the top each time.

Directories

Each directory is a list of directory entries. Each directory entry associates one file name with one inode number, and consists of the inode number, the length of the file name, and the actual text of the file name. To find a file, the directory is searched front-to-back for the associated filename. For reasonable directory sizes, this is fine. But for huge large directories this is inefficient, and ext3 offers a second way of storing directories that is more efficient than just a list of filenames.

So, not only does that mean that in a 1 million file directory, that you need to inspect 500,000 files on average each time, it also means that the VFS is unlikely to be able to retain the entire directory file in cache, which means frequent re-reads.

Ext3 has a mechanism (Htree) for improving this:Creating 100,000 files in a single directory took 38 minutes without directory indexing... and 11 seconds with the directory indexing turned on.. The trouble is, very few people use it.

A few years ago, (from memory, on BSD), moving ~100 million files from a single directory to a three level hierarchy improved the time taken to locate and read small (a few kb) files from whole seconds to 10 of milliseconds. Try it yourself to see the difference it makes.

By reducing the size of the directories to 100 or 256, the entire directory at any level can fit into a single block. The root level effectively gets locked in cache making the first reduction happen in microseconds. And for an application like the OPs where most accesses will be for the latest message IDs, the one or two second level directories that will be most accessed will also tend to remain cache resident. So in use, most accesses will not need to hit the disk at all until it comes to reading or writing the top level file.

The benefits are far less when there is no locality of reference -- ie. the files are accessed randomly -- but for the OPs application, they should be tangilble and very worthwhile.

I also agree that going too deep negates the benefits.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^2: Design flat files database by BrowserUk
in thread Design flat files database by AlfaProject

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2024-04-23 19:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found