Advice on Efficient Large-scale Web Crawlingby Anonymous Monk
|on Dec 19, 2005 at 12:45 UTC||Need Help??|
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I'm writing a system which needs to fetch well over four million URLs (and this number will grow), save their contents, and record their host's IP address(es). (Note: I'm only doing GET $url -- I'm not spidering the sites).
I've written a Pool:: and Pool::Job module which let me do something like this:
(Yeah, I'm sure there's a better way to do this just waiting on CPAN, but... ;-))
The reason for this complexity is that I need to process these jobs in parallel, will need to retry some jobs, and need the crawler to start working as soon as possible after it starts up if it crashes. This architecture allows that.
My current crawler uses LWP::UserAgent and Parallel::ForkManager. It writes the page content to individual flat files. This works, but is slow, appears to just die occasionally, and doesn't allow me to get the IP address(es) efficiently.
Regarding the IPs, I could simply use URI:: to extract the hostname and then use inet_ntoa() and/or gethostbyname(), but this would require performing a DNS lookup twice per URL: once for LWP and once for inet_ntoa/gethostbyname. Ideally, I'd have LWP return the IP it resolved, but I can't see an easy way to do this. Of course, even then for hosts which have multiple IP addresses, I'm resolving them multiple times. I imagine that the answer is: use a cache, but where? Use a Memoize type wrapper for inet_ntoa/gethostbyname? Use a DNS cache at the OS level (something like djbdns?)? Alternatively, I can just accept that this inefficiency is OK, and bulk resolve the hostnames independently (which will make removing duplicates trivial) using something like http://he.fi/slookup/ . I would like to do as much in Perl as possible, however, so POE::Component::Client::DNS looks useful if I do look them up in bulk...
If I do fetch the URLs and resolve the IPs in the same process, there appear to be two main options: use a hacked version of HTTP::Lite which returns the IP address (a trivial change) along with Parallel::ForkManager, or use POE::Component::Client::HTTP and figure out a way to have it return the IP.
Regarding POE::Component::Client::HTTP, I'm a bit confused as to how it works under the hood... What limits the amount of concurrent processes does it run, or is this a stupid question? With most parallel modules that I've used you can configure a maximum level of processes to run concurrently.
Anyway, that's a brief overview of the problem. Any thoughts on the most efficient way to approach this? Better modules to use? :-)