Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re^3: Advice on Efficient Large-scale Web Crawling

by matija (Priest)
on Dec 19, 2005 at 14:57 UTC ( #517740=note: print w/replies, xml ) Need Help??

in reply to Re^2: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

I suggest you simply change that to two hex digits per directory name, e.g.


That should reduce the average number of files per directory to a much more reasonable 60 and change.

And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.

  • Comment on Re^3: Advice on Efficient Large-scale Web Crawling

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://517740]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2018-06-20 16:07 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.