Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Advice on Efficient Large-scale Web Crawling

by matija (Priest)
on Dec 19, 2005 at 13:45 UTC ( #517718=note: print w/ replies, xml ) Need Help??


in reply to Advice on Efficient Large-scale Web Crawling

Personaly, I think you're engaging in premature optimization here: when fetching 4M urls, the DNS traffic is unlikely to be your biggest concern.

Having said that, the cheapest/cleanest method would be to install a caching-only DNS server on your localhost, and let it handle the DNS caching.

Some reasons why your current solution might be slow:

  • are all those 4 pages each in a flat file, and all the flat files in one directory? You'd be better off distributing them over a tree of directories.
  • Do you have enough bandwidth to download all those pages? The line might be saturated with that much data. If you are connected through some asymetric line (like ADSL), your downloads could be chocked by the lack of bandwidth for the ACK traffic.
  • Do you have enough memory for all the processes you've started? If your processes are being swapped out, they will not only be running more slowly as different processes are getting swapped in and out, but they'll probably compete for disk bandwidth with the files you're writing out.


Comment on Re: Advice on Efficient Large-scale Web Crawling
Re^2: Advice on Efficient Large-scale Web Crawling
by Anonymous Monk on Dec 19, 2005 at 14:11 UTC

    Yeah, I'm leaning towards a local DNS cache as well. Thanks.

    Currently the pool is a hierarchy of directories like this:

    pool/
    pool/todo
    pool/doing
    pool/done

    A sample file path is

    pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69

    This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.

    I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?

    I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.

    Does this sound somewhat sensible? :-)

      To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes

      that limit seems too low for the task you want to accomplish, specially if you have a good internet connection. Have you actually tried incrementing it to 30 or even 50. Forking is not so expensive in moderm Unix/Linux systems with support for COW.

      update: actually, much of the overhead generated by the forked processes can be caused by perl cleaning up everything. On Unix, this cleanup is mostly useless, and you can get rid of it calling

      exec $ok ? '/bin/true' : '/bin/false';
      instead of exit($ok) to finalize child processes. Just remember to close first any file you had written to.

        That's what POSIX::_exit is for. Exit the process without giving Perl (or anything else) a chance to clean up.

      With a single hex digit in the directory you get an average of 15625 files per directory, which is still too many (IMHO). It might work if the filesystem has hashed directory lookups, but I can't remember offhand which file systems do and which don't have that.

      I suggest you simply change that to two hex digits per directory name, e.g.

      pool/todo/a6/86/a6869c08bcaa2bb6f878de99491efec4f16d0d69
      
      
      That should reduce the average number of files per directory to a much more reasonable 60 and change.

      And yes, benchmarking (lots and lots of benchmarking) and tweaking seem to be the best way to tackle this kind of problems.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://517718]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2014-07-26 17:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (178 votes), past polls