Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I'm writing a system which needs to fetch well over four million URLs (and this number will grow), save their contents, and record their host's IP address(es). (Note: I'm only doing GET $url -- I'm not spidering the sites).

I've written a Pool:: and Pool::Job module which let me do something like this:

my $pool = Pool->new; while (my $job = $pool->get_next_job) { #do something with $job->url if ($error) { $job->skip; } else { $job->done; } }

(Yeah, I'm sure there's a better way to do this just waiting on CPAN, but... ;-))

The reason for this complexity is that I need to process these jobs in parallel, will need to retry some jobs, and need the crawler to start working as soon as possible after it starts up if it crashes. This architecture allows that.

My current crawler uses LWP::UserAgent and Parallel::ForkManager. It writes the page content to individual flat files. This works, but is slow, appears to just die occasionally, and doesn't allow me to get the IP address(es) efficiently.

Regarding the IPs, I could simply use URI:: to extract the hostname and then use inet_ntoa() and/or gethostbyname(), but this would require performing a DNS lookup twice per URL: once for LWP and once for inet_ntoa/gethostbyname. Ideally, I'd have LWP return the IP it resolved, but I can't see an easy way to do this. Of course, even then for hosts which have multiple IP addresses, I'm resolving them multiple times. I imagine that the answer is: use a cache, but where? Use a Memoize type wrapper for inet_ntoa/gethostbyname? Use a DNS cache at the OS level (something like djbdns?)? Alternatively, I can just accept that this inefficiency is OK, and bulk resolve the hostnames independently (which will make removing duplicates trivial) using something like http://he.fi/slookup/ . I would like to do as much in Perl as possible, however, so POE::Component::Client::DNS looks useful if I do look them up in bulk...

If I do fetch the URLs and resolve the IPs in the same process, there appear to be two main options: use a hacked version of HTTP::Lite which returns the IP address (a trivial change) along with Parallel::ForkManager, or use POE::Component::Client::HTTP and figure out a way to have it return the IP.

Regarding POE::Component::Client::HTTP, I'm a bit confused as to how it works under the hood... What limits the amount of concurrent processes does it run, or is this a stupid question? With most parallel modules that I've used you can configure a maximum level of processes to run concurrently.

Anyway, that's a brief overview of the problem. Any thoughts on the most efficient way to approach this? Better modules to use? :-)


In reply to Advice on Efficient Large-scale Web Crawling by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2024-03-29 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found