Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: Efficient processing of large directory (n-tier directories and the Grubby Pages Effect)

by grinder (Bishop)
on Oct 02, 2003 at 17:16 UTC ( #295976=note: print w/replies, xml ) Need Help??

in reply to Efficient processing of large directory

By definition, one cannot process a large directory efficiently :)

Find a way to key the file names so that you can move to a multi-level directory structure. This might be the first two characters or last two characters of the filename:

filename key result

dabhds.txt d d/dabhds.txt
xyzzy.txt x x/xyzzy.txt

43816.txt 16 16/43816.txt
73813.txt 13 13/73813.txt

The main point to remember is that given the filename, you can derive the directory it should be in. And if it gets moved to the wrong directory, you can check for it programmatically.

I worked on a web site like yours once. When I was called in, there were more than 600 000 files sitting in one directory. This was on a Linux 2.0 kernel on an ext2 filesystem. The directory entry itself was over 3 megabytes. Needless to say performance suffered...

We managed to get it into a three-level directory structure (7/78/78123.txt) but it tooks hours of time at the kernel level because the directory traversal was so slow.

Bite the bullet and reorganise your files, before it's too late!

Note that the filenames are numeric (1232.txt etc. etc.) then you want to key on the last digits, not the first digits, otherwise you'll skew the number of files per directory to the low-numbered ones because of the Grubby Pages Effect (more links here),

  • Comment on Re: Efficient processing of large directory (n-tier directories and the Grubby Pages Effect)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://295976]
[Corion]: SuicideJunkie: :-D
[marinersk]: SuicideJunkie LOL
[choroba]: Woohoo! Fixed a test that hasn't run for 3 years.
[marinersk]: Corion Yes, sometimes whitespace in column headers is acceptable, but I still consider it be less than desireable if that query might get revectored for an ETL-esque process...
[marinersk]: choroba++
[choroba]: it's a long running test, so it's normally skipped unless an env var is set
[choroba]: nobody has been bothered to set the variable in the last 3 years
[marinersk]: sub newtest{my $expected_result = &target('foo'); my $actual_result = &target('foo'); if ($actual_result eq $expected_result) { &tdd_success(); } else { &tdd_fail(); } } # Test works after three years!
[choroba]: or nobody bothered...
[choroba]: The problem was bigger, as the test tried to call a method that didn't exist anymore

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (11)
As of 2017-05-25 15:08 GMT
Find Nodes?
    Voting Booth?