in reply to Re: Re: Re: Re: writing to the top of a file
in thread writing to the top of a file

using the file system as a DBMS for tracking thousands of itty-bitty chunks of information is a ridiculous misuse of file system resources

So what are we supposed to use it for? Tracking icecream? The point of a filesystem is to store my data. If it can't do what I want, then I just need a better filesystem.

When you create many thousands of one-line text files, you pay a severe penalty in OS and hardware

I don't believe that the penalty is too bad. If you post figures, I'd like to have a look - sounds interesting.

Not to mention that you'd need to complicate things a bit more to make sure file names don't collide -- this can get hairy when multiple threads or processes are abusing the file system this way.

Multiple threads are abusing the filesystem by writing to it? What kind of computer are you using? Filesystems are there so that multiple processes can read and write files without stepping on each others toes. Seriously. What were you thinking when you wrote this? Do your store your files on reel-to-reel tape?

POSIX provides the tmpfile() call, which guarantees a unique temporary file. Perl has its own tempfile module called File::Temp, which apparently uses the same call. There is no 'hairiness' at all.

Demonstrating this lack of knowledge makes your other assertions look suspect.

____________________
Jeremy
I didn't believe in evil until I dated it.

  • Comment on Re: Re: Re: Re: Re: writing to the top of a file

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: writing to the top of a file
by graff (Chancellor) on May 21, 2004 at 19:38 UTC
    I don't believe that the penalty is too bad. If you post figures, I'd like to have a look - sounds interesting.

    If you're interested, you can generate the figures yourself (experience is the best teacher).

    Pick some log file (or generate some text stream, e.g. using "find / -ls") of, say, 100,000 or so lines -- the more the "better" (for demonstration purposes) -- and run it through a script like the following:

    #!/usr/bin/perl while (<>) { open( OUT, ">line.$." ) or die $!; print OUT; close OUT; }
    Time the script, to see how long it takes to read the input and create that many files. Then time how long it takes just to read the input (e.g. perl -ne '$s+=length(); END {print $s,$/} ).

    Then see how long it takes to read all the little files back (e.g. to reconstruct a copy of the original text).

    While you're at it, if you happen to be doing this on a disk that gets backed up on a regular schedule, and can get information about the timing and performance of the backup, compare how that works with and without the directory full of 100,000 or so little files.

    This is not an issue of simply saying "using little files is bad, don't do that" -- it's a matter of understanding that this sort of approach does not scale well to large quantities (and in any case there are usually better ways to get the same functional result). That's why people have written things like database systems to optimize the storage and retrieval of data from disks.

    Of course you won't see a problem when the numbers are in the hundreds or even the few thousands, but at some point it becomes quite expensive, and you'll wish that you were making more efficient use of the filesystem.

    In a similar vein, I didn't say that having multiple processes/threads writing to a directory is "abuse" -- of course there is nothing wrong with this. But having any number (one or more) that mimic the behavior of the sample script above is foolish, and by multiplying the processes that do this, you multiply the foolishness, potentially to the point of being abusive.

      Yep, this is a susupect test. The files will still be hot in memory, confounding your attempts to measure them.

      Plus, any professional strength backup solution will cope with small files just fine.

      That's why people have written things like database systems to optimize the storage and retrieval of data from disks.

      People wrote database systems so they could efficiently store and search relational data. Programmers abused databases because they were faster than the shitty filesystem drivers most vendors shipped. This is no longer the case, but I am continuously deluged by sweaty little morlocks who tell me that the solution to all data storage problems is a relational database.

      In general, they just use it as a hash table, which is something that filesystems are much better at today.

      Your objections are based on using consumer grade hardware. Run your own example, but look at the disk I/O meter. You will see that your benchmark is I/O bound - something that can only be improved by better hardware.

      If you are just running a site for a few friends, my solution will be great. If you are running a big site with lots of hits, you'll have I/O channels that work at bus speeds, and my solution will still be great.

      All your post says is that you have shitty hardware, and you're generalising that to 'the world has shitty hardware'. And the votes are going the way thay are because there are more readers who are deluded that their P4-3Ghz is the best computer in the world, and have no idea that a $100k 1Ghz Sun server will beat the pants off it in every test that counts... like handling small files.

      ____________________
      Jeremy
      I didn't believe in evil until I dated it.

        Jeremy,

        I feel compelled to step in here with a mild, but no less heartfelt, rebuke.

        One of the things that makes Perl Monks unique among internet discussion groups is our collegiality. Even when we disagree, we accord each other a certain degree of respect. In this sanctuary of civility, questions and their solutions can be debated in a well-mannered fashion, without snide remarks or personal attacks. You said,

        "And the votes are going the way thay are because there are more readers who are deluded that their P4-3Ghz is the best computer in the world, and have no idea that a $100k 1Ghz Sun server will beat the pants off it in every test that counts... like handling small files."
        I'd like to propose an alternate explanation. Many of your technical points may well have some validity. But I suspect that the majority of any negative votes you've accumulated in this thread are due to the arrogant, dismissive attitude that comes through in your posts. None of us, neither saint nor accolyte, is God's gift to programming. We're all here to learn from each other. And statements like, "You're telling me how to suck eggs.", or "So what are we supposed to use it for? Tracking icecream?", are not constructive contributions to the discussion. It's obvious you have a unique perspective and a lot of great ideas to contribute. But more people will pay attention if you demonstrate a little more regard for your intended audience.

        Thanks for listening,
        Phil

        OK fine, if you have a few extra $100K's to throw around and cover up a sub-optimal process design, go ahead and spend your money that way. Keep those wheels of commerce rolling!

        I work at a university (i.e. with a non-commercial academic budget -- in linguistics, mind you), so I've had to put up with lower-grade Sun Enterprise hardware (including the multi-terabyte tape robot for backup) for the last several years. And by God golly, when one of my programmers installed a directory tree with a few million little files on one of the raids, everybody felt the pain, including the poor sysadmin trying to work around the level-0 backups that either took a week to finish or simply failed. I had to tell my programmer not to do things that way, because I can't just go out and buy bigger iron.

        Mercifully, we are finally migrating to FreeBSD (and running it on a variety of hardware from the more "lean and hungry" vendors). Things are looking up.

        Got any more flame bait? (as if I had to ask)

        (updated to chill on the hyperbole. Thnx, jepri.)