Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Perl solution for storage of large number of small files

by isync (Hermit)
on Apr 30, 2007 at 07:32 UTC ( #612707=perlquestion: print w/replies, xml ) Need Help??
isync has asked for the wisdom of the Perl Monks concerning the following question:

I need to store a LOT of small files around 40K each.

In order to circumvent filesystem limits regarding the number of files, and keeping it all manageable and backup-able I decided to use a solution with a set of tied DB_File hashes as a kind of pseudo-filesystem, each file holding thousands of small recordsets/the files.
Currently I use 16 hashes, round-robin storing files into these buckets. But with about 100,000 files stored into the set of 16 50MB files, I get speed issues. (DB_File seems practical only for about 5,000 records after all, getting slower data insert times on growing files, under good conditions 0.04 seconds each and up).

Does anyone know about a fast perl solution, a module or a producedure to manage 1,000,000+ files in a relatively small set of data-buckets? Or is the large quantity of files no problem when I sort them away in enough subdirectories (am I fighting demons here..)?
I would use a solid file structure, like a .gz, but I need to be able to delete single files from the archive and re-allocate free'd space...

Another thought:
Tests show: 0.04 seconds per insert seems to be the barrier imposed by the filesystem for writing to disk. Same as writing a ~40K file. Larger files seem to write quicker (~200K in 0.02sec). Does this ring a bell for someone and point to a solution?
  • Comment on Perl solution for storage of large number of small files

Replies are listed 'Best First'.
Re: Perl solution for storage of large number of small files
by jbert (Priest) on Apr 30, 2007 at 09:15 UTC
    My first suggestion would be to go with the filesystem, opening up levels of subdirectory as needed. (My second note would be that if this is an email storage system, there is a lot of prior code on this, check out some of the free IMAP servers).

    Note that not all filesystems are created equal. They each make different tradeoffs and some will have much better performance for this load than others.

    A journalling filesystem should perform better for a write-heavy load than a non-journalling fs. Disks are fastest when you are writing (or reading) lots of sequential data. File creation and updates will go to the (hopefully sequential) journal and should help a lot. If you simultaneously have a heavy read load, you'll lose a lot in performance - due to seeking - unless you can satisfy most reads from cache (which is, for example, the case in an MTA email queue under most sensible queueing strategies).

    Your measurements showing that writing larger files is quicker is surprising. Can you try a 'strace' on the two cases in question to see if the application code is doing anything different in the two cases?

    Can you tell any more about the application and expected access patterns? It sounds interesting.

      Actually I am thinking about completely switching over to the filesystem-only approach and stop toying around with this data-buckets idea. BTW: What is the maximum number of files on my ext3?

      I've got a client-server architecture of scripts here - no emails, no imap server... It's a research project. (But challenges seem similar - thanks for the hint!)
      A data-generation script gathers measuring data and produces the 40K-120K pakets of output, while a second script takes this output and makes sense of it thus enriching the meta-data index. Both scripts are controlled by a single handler which keeps the meta-data index and stores the data-pakets (enabling us to run everything in clusters). And that handler is where the bottleneck is. So I am thinking about taking off the storage part from the handler and let the data gatherer write to disk directly via NFS.

      NFS was also the solution in the "larger files is quicker" paradox. My development machine tied a hash via NFS which resulted in this. Now, actually running the script on the server told me that the tie is always fast. The insert is fast most of the time (although every few cycles, when DB_File expands or so, it slows down..). But the untie takes forever on growing files...

      The expected access pattern is mostly plain storage (gatherer), followed by one read on every stored paket (sense-maker). Then every few days an update/rewrite on all pakets involving possible resizing(gatherer again).

      The "new toy" idea is now to use a set of disks tied together via NFS(distributed) or LVM(locally), mounted on subdirs building a tree of storage space leaves (replacing my few-files approach).
        The maximum number of files on a filesystem is limited by the number of inodes allocated when you create it (see 'mke2fs' and the output of 'df -i'). You can also tweak various settings on ext2/ext3 with tune2fs.

        As you probably already know, written data is buffered in many places between your application and the disk. Firstly, the perlio layer (and/or stdio in the C library) may buffer data - this is controlled by $| or the equivalent methods in the more modern I/O packages.

        Flushing will ensure the data is written to the kernel, but it won't ensure the kernel writes it to disk. You need the 'fsync' system call for this (and/or the 'sync' system call). You can get access to these via the File::Sync module.

        Note that closing a filehandle only *flushes* it (write userland buffers), it does not *sync* it (write kernel buffers).

        (If you're paranoid and/or writing email software, you may also want to note that syncing only guarantees that the kernel has successfully written the data to the disk. Most/all disks these days have a write buffer - there isn't a guarantee that data in that write buffer makes it onto persistent storage in the event of a power failure. You can get around this in various ways, but I'm drifting just a bit out of scope here...)

        The above is to suggest an explanation for 'untie' taking a long time (flushing lots of buffered data on close), and it's also something anyone doing performance-related work on disk systems should know about. In particular, it may suggest why sqlite seemed slow on your workload. For robustness, sqlite calls 'fsync' (resulting in real disk I/O) at appropriate times (i.e. when it tells you that an insert has completed).

        (Looking at one of your other replies...) If you are writing a lot of data to sqlite, you'll probably want to investigate the use of transactions and/or the 'async' mode. By default, sqlite is slow-but-safe. By default, writing data to a bunch of files is quick-but-unsafe. (But both systems can operate in both modes, you just need to make the right system calls or config options).

        If you're going to be doing speed comparisons between storage approaches, you need to be sure of your needs for data integrity and then put each storage system into the mode that suits your needs before comparing. (You may well be doing all this already - apologies for the length response if so).

Re: Perl solution for storage of large number of small files
by merlyn (Sage) on Apr 30, 2007 at 10:49 UTC
      That was my thought as well. File systems aren't optimized to handle 1e6+ files in constant use, and that's exactly what databases ARE optimized for. Why does the problem definition require files, or is that negotiable?

        I had a stint on this project with sqlite, and although I like databases, I was (so far) under the impression storing binary data into a database is not the best way to go. For me the blob columns always felt like an add-on...

        Second, the performance with sqlite (async off) was poor, and it was only munging light text data. This is why I never thought of going completely db with this. And I didn't want to install a full mysql server on the machine, just to find out that it would have the same problems.

        I think using the fs is ok for ~20-30 mio. files. So, do you opt for using a db when this number grows?
Re: Perl solution for storage of large number of small files
by BrowserUk (Pope) on Apr 30, 2007 at 10:13 UTC

    How are the 'files' keyed?

    Is lookup always by name or are there other factors involved?

    What are the minimum & maximum sizes? Are these predictable or are you likely to get occasional files that break those limits?

    Do you need single or multiple concurrent access to them?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      32 chars names, lookup is always by name, full reads no seeks inside.

      Files can be everything from 5-500K of data...

      Concurrent writes to the storage, but never on the same file (as everything is passed through an index, serializing events, see above reply)

      Sounds like you have an interesting solution in mind...
      (This thread is becoming a fun-discussion, not only solving my problem but thinking about data-mining in general)
Re: Perl solution for storage of large number of small files
by rhesa (Vicar) on Apr 30, 2007 at 15:22 UTC
    I generate an md5 hash from the filename, and use that to create a file hierarchy. I split the hash into chunks, which I then use to create subdirectories. This keeps the number of entries per level within filesystem limits, and lookup stays relatively fast. The distributive properties of MD5 garantuee a good spread among buckets.

    For example, a filename of "foo12345.txt" might map to "/4d9/a43/9b6/eb3/f21/600/448/a03/b78/5a3/17". Each subdirectory has a maximum of 4096 subdirectories, which is well within the limit of ext3. By varying where you place the slashes, you control the breadth and depth of your directory tree.

    I personally use a mapping of "/11/222/333/rest_of_hash" to archive (currently) over 10 million files. That scheme allows for 256*4096*4096 buckets, which is quite enough for my purposes. Speed and server load are excellent, even under highly concurrent usage.

      Seems like exactly what I do now (after switching back to filesystem)!

      And the buckets? Are you talking about DB_Files you store away in this subdir structure (thousands of small dbs effectively, where you put foo12345.txt)? Or are you just spreading the files by the hash and then you store /hash/path/to/file/foo12345.txt?

        I concur with rhesa's method. I've used this before with considerable success.

        I just untarred a test structure containing a million files distributed this way using 3 levels of subdirectory to give an average of ~250 file per directory. I then ran a quick test of opening and reading 10,000 files at random and got an average time to locate, open, read and close each file of 12ms.

        #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use Digest::MD5 qw[ md5_hex ]; use Benchmark::Timer; our $SAMPLE ||= 1000; my $T = new Benchmark::Timer; for my $i ( 1 .. $SAMPLE ) { $T->start( 'encode/open/read/close' ); my $md5 = md5_hex( int( rand 1e6 ) ); my( $a, $b, $c ) = unpack 'AAA', $md5 ; $T->start( 'open' ); open my $fh, '<', "fs/$a/$b/$c/$md5.dat" or warn "fs/$a/$b/$c/$md5 + : $!"; $T->stop( 'open' ); $T->start( 'read' ); my $data = do{ local $/; <$fh> }; $T->stop( 'read' ); $T->start( 'close' ); close $fh; $T->stop( 'close' ); $T->stop( 'encode/open/read/close' ); } $T->report; __END__ c:\test>612729-r -SAMPLE=10000 10000 trials of encode/open/read/close (112.397s total), 11.240ms/tria +l 10000 trials of open (110.562s total), 11.056ms/trial 10000 trials of read (158.554ms total), 15us/trial 10000 trials of close (365.520ms total), 36us/trial

        The files in this case are all 4k, but that doesn't affect your seek time. If you envisage needing to deal with much more than 1 million files, moving to 4 levels of hierarchy woudl distribute the million files at just 15 per directory.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Ah, sorry about that: my use of the word "bucket" was a bit sloppy. I just store the file (I deal with images, mostly).

      Be aware of where your filesystem slows down. There is often times a limit on the number of files that you can store in a directory before access starts getting slow. Adding another level of directories can be faster than adding more directories to the current one.

      While old, this explains my ramblings from above.


      I do *almost* the same exact thing, for a similarly large (and growing) archive I do:
      /123/456/789/123456789...32 or /123/456/789/entire-hash
      agreed on the notion that "The distributive properties of MD5 garantuee a good spread among buckets.". You should never do this with numbers, but if you must, you can reverse the numbers to get a decent distribution ( ie - 12345 -> 54321 ) another difference that i do is the following-- i name the file the md5, and store the original file name in postgres. that gives me a speedup on fileserving. its relatively simple to create rewrite rules under apache & nginx to directly serve these files too.

      I forgot to add...

      The 'best' way I discovered to store info is to use a base32 encoded digest of the file.

      Why? well, most filesystems start to have issues at some point after 1000 items per directory. using a base32 representation, you can hit the sweet spot.

      with a base32 formula and 2chars per bin, you'll have 1024 bins per depth ( 32*32 ). using the standard hex based md5 representation, your options are either 256 buckets per depth ( 16*16 ) or 4096 ( 16*16*16 ).

      in my personal use, i haven't seen all that much of a difference between 1024 and 4096 buckets -- though i've seen a slight difference. its not as drastic as the performance between either and 10k though.

      since i'm lazy and I don't have high performace constraints, i just go with 2 levels deep of 4096 hashing. but if i had more time, I'd definitely go with 3 levels of 1024 hashing.

      ( the time / lazyness is a factor because every language supports md5 as base16 or base64 by default -- no one has a base32 default , which is a PITA if you're managing an architecture accessed by multiple languages at once ).

Re: Perl solution for storage of large number of small files
by diotalevi (Canon) on Apr 30, 2007 at 14:08 UTC

    It sounds like you didn't read the BerkeleyDB documentation. If your record is larger than your page size then the record is stored as an overflow record. I don't know how overflow records are accessed. In your case, perhaps you should try using a 64K page size and trying your benchmark again. The default size is 8K which far below your typical record size.

    First, the page size implicitly sets the size of an overflow record. Overflow records are key or data items that are too large to fit on a normal database page because of their size, and are therefore stored in overflow pages. Overflow pages are pages that exist outside of the normal database structure. For this reason, there is often a significant performance penalty associated with retrieving or modifying overflow records. Selecting a page size that is too small, and which forces the creation of large numbers of overflow pages, can seriously impact the performance of an application.

    Berkeley DB Reference Guide: Selecting a page size

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Perl solution for storage of large number of small files
by BrowserUk (Pope) on Apr 30, 2007 at 14:55 UTC

    Be very wary of the DB hammer wielders treating your problem as a nail. In most cases, RDBMSs have fairly small limits on the size (eg.8kb) of individual tuples that can be stored in-line. When they exceed this size, they are moved into out-of-line storage with a pointer substituted into the main table.

    Out-of-line storage often as not translates into "the file system" in some fashion or other. Given the size of your entities, it means that putting them into an RDBMS will essentially just add the overhead of the RDBMS to your accesses and you will still carry the overhead of the filesystem. The big difference is that you will not have any control over where and how your data is stored within the filesystem and so will lose the opportunities for optimisations based upon your knowledge of that data and it's location.

    For a little more on this see the description of PostGreSQL's TOAST.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Perl solution for storage of large number of small files
by mattr (Curate) on Apr 30, 2007 at 13:06 UTC
    I'm also curious about why you don't want to use a database. Is it because you want to be absolutely sure of when data is safely stored on the disk, or because it is faster this way? IIRC Oracle was designed at least originally to take advantage of physical layout on the disk, though perhaps not so important these days. Did you try dumping this into Mysql or PostgreSQL and dislike the solution for some reason? Have you tried sleepycat's BDB?

    Incidentally InnoDB performance tuning tips notes:

    Wrap several modifications into one transaction. InnoDB must flush the log to disk at each transaction commit if that transaction made modifications to the database. The rotation speed of a disk is typically at most 167 revolutions/second, which constrains the number of commits to the same 167th of a second if the disk does not fool the operating system.

    So one disk rotation is 6msec minimum right there. Are you spreading your tied files across several disks? Do you require every write to be saved to disk physically instantaneously, or can you wait a second or so?

    Oh, the other thing is if you have disk to burn you could increase your inode size, on XFS, or mirror your disks for speed. But regardless, it seems that moving to a database implementation now rather than waiting for things to explode might be a good idea. I don't suppose your system could do locking to handle multiple writers, could it? Perhaps more info about what you are actually trying to do would be useful.

    Also, I was thinking about a presentation at YAPC::Asia I think it was, about how a large service was built on Perl. Livedoor or Mixy. Anyway they split their indices and tables across different servers (using the first characters of user names IIRC). They built a system capable of easily repartitioning this layout as users increase.

      DB_File is the old API for BerkeleyDB, the Sleepycat database. It's the same thing.

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      InnoDB must flush the log to disk at each transaction commit if that transaction made modifications to the database. ... constrains the number of commits to the same 167th of a second

      unless you set innodb_flush_log_at_trx_commit to 0, which switches it to flush once a second.

      HTH, andye

        Thanks, that is the page I was looking at, why I mentioned "1 second or so". It seemed he was unwilling to wait that long but buffering it would it seems increase efficiency.
Re: Perl solution for storage of large number of small files
by salva (Abbot) on Apr 30, 2007 at 09:50 UTC
    you can try using another database backend, for instance DBI + DBD::SQlite.
      Been there, done that. Actually for the meta-data index of the heavy-load storage...
      The first incarnation was a DBM:mldb. The second version sqlite, with which I ran into a heavy disk IO overhead inserting/updating meta-data, now the index is in-memory as plain data structure...

      So, do you actually recommend sqlite as storage for binary data?
        So, do you actually recommend sqlite as storage for binary data?

        Well, I don't recommend neither disrecommend it. I was only suggesting you should try another backend!

        Which database is the best for a given problem, does not depend exclusively on the data structures but also on the usage pattern.

        Anyway, if you need to access 2GB of data randomly, there is probably nothing you can do to stop disk trashing other than adding more RAM to your machine, so that all the disk sectors used for the database remain cached.

Re: Perl solution for storage of large number of small files
by Moron (Curate) on May 01, 2007 at 10:03 UTC
    Archive::Tar appears to meet the whole checklist. It also enables you to write a trivial wrapper that manages compressed archives with all the required list/add/create/remove file operations.

    ^M Free your mind!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://612707]
Approved by Mutant
Front-paged by TStanley
[erix]: they've stuck nuclear missiles in Kaliningrad - they can just reach Berlin :)
[LanX]: naaah! Putin wouldn't nuke all the Russians in Berlin
[Your Mother]: Putin is a bad guy but heís not that kind of bad guy. (Who would launch.)
[erix]: fortunately, Putin seems to be ill -- or at least, they're building a hospital for him and his cronies in Moscow
[erix]: nobody knows what kind it takes :)
[erix]: I say his seems a face that launches a thousand missiles :)
[Your Mother]: I say I do. :P
[Your Mother]: Missiles would be an idiotís play. Putin is many things but he is not an idiot.
[erix]: I like the idea of the trumps troup of imbeciles growing slightly nervous now their incompetence will be tested
[erix]: you seem forget there is another idiot coming :)

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2017-01-20 00:56 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (173 votes). Check out past polls.