Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Async DNS with LWP

by jc (Acolyte)
on Oct 05, 2010 at 00:08 UTC ( #863479=perlquestion: print w/ replies, xml ) Need Help??
jc has asked for the wisdom of the Perl Monks concerning the following question:

Hi, has anybody ever managed to get LWP working with an asynchronous resolver?

Comment on Async DNS with LWP
Re: Async DNS with LWP
by Anonymous Monk on Oct 05, 2010 at 06:43 UTC
    No, never even heard of an asynchronous resolver, but LWP does not do any resolving itself, it relies on the operating system
Re: Async DNS with LWP
by Corion (Pope) on Oct 05, 2010 at 07:11 UTC

    AnyEvent::DNS is an asynchronous resolver. I guess you can resolve the IPs using AnyEvent::DNS and then use LWP, but if you're using AnyEvent(::DNS) already, I would stay asynchronous and use AnyEvent::HTTP to do the HTTP requests.

      Hi Corion, thanks for your advice. I had thought about using AnyEvent::DNS but there don't seem to be any obvious ways of getting LWP to use its results rather than doing its own synchronous resolution (via the OS). Now AnyEvent::HTTP uses AnyEvent::DNS out of the box and using AnyEvent::HTTP sounds like good advice. However, I'm wondering if this is now going to create more problems than it solves. Use of AnyEvent::HTTP implies implementing explicit logic to make the browser stateful and handle cookies and referers correctly. It also implies complications with generating output in a form that HTTP::LinkExtor can parse to extract links. Has anybody ever got a stateful web crawler based on AnyEvent::HTTP working?

        I think it shouldn't be too hard to push the results into a WWW::Mechanize object when they are available. WWW::Mechanize will then do the cookie extraction etc. and if you're using raw LWP, you're extracting the cookies yourself anyway. You then need to override/capture the request that WWW::Mechanize (or LWP) generates when you ->get or follow a link. This request is then again handed off to AnyEvent::HTTP.

        I'm not sure that it makes much sense to rewrite WWW::Mechanize to be based on AnyEvent::HTTP, because you will need asynchronicity all over the place anyway.

        You could look into spawning threads or simply spawning external processes to handle your requests, but if you're already looking into asynchronous resolvers, you're either prematurely optimizing the wrong end of the task or the overhead from launching threads or processes will eat into your time/latency budget.

        I had thought about using AnyEvent::DNS but there don't seem to be any obvious ways of getting LWP to use its results rather than doing its own synchronous resolution (via the OS).

        Surely, if you resolve the domain name yourself (asynchronously or not), and then supply the resolved dotted decimal as part of the url you supply to LWP, it won't have to, or be able to, do the resolution again itself?

        (I know; you don't like being called Shirley:)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Async DNS with LWP
by Proclus (Beadle) on Oct 05, 2010 at 11:09 UTC
    Since I use POE for all my async projects, I prefer POE::Component::Client::DNS. POE also has a drop in LWP replacement but I'm not sure if it has a DNS resolver.
Re: Async DNS with LWP
by ikegami (Pope) on Oct 05, 2010 at 16:08 UTC
    I have a module that replaces LWP's HTTP and HTTPS backend with AnyEvent::HTTP allowing you to easily do parallel requests using Coro threads. I'll try to publish it tonight or tomorrow night.
    my @threads; for (...) { push @threads, async { ... do LWP stuff here ... }; } $_->join() for @threads;

    Update: Oh yeah, Coro already provides some kind of support for HTTP through LWP, but it's hackish and it doesn't work with HTTPS.

      That sounds great. AnyEvent::HTTP with Coro is just about the conclusion I've arrived at and I'm making some progress with it. So now I'm wondering if your changes can be ported to WWW::Mechanize... That would certainly make developing stateful crawling a lot easier.
        WWW::Mechanize doesn't actually do any socket work. It lets LWP do it, so nothing needs to be done. Keep in mind that Coro is cooperative multitasking, so your sockets can't receive anything if your crawler is spending a lot of time not waiting for data.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://863479]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (16)
As of 2014-09-18 17:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (120 votes), past polls