Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^13: Async DNS with LWP

by BrowserUk (Patriarch)
on Oct 08, 2010 at 11:10 UTC ( [id://864182]=note: print w/replies, xml ) Need Help??


in reply to Re^12: Async DNS with LWP
in thread Async DNS with LWP

I'm sorry, but worrying about async DNS at this point is ... well, pointless. Let's do some math.

With your current setup, 90e6 sites with say an average of 100k per home page(*). To download that lot in your 100 hour allocation, you'd need to be fetching constantly at a rate of 25Mbytes/s. That would would (conservatively) require a 250Mbps connection. To do it in your target 3 hours you'd need an 8 Gbps connection.

Now, I'm not sure what data rates you can achieve with GSM (GPRS,EGPRS) in the US, but I'm pretty sure they'll be measured in 10s of Kbps. Not Mbps much less Gbps.

Even once you moved to your hoster, if you could sustain their 100Mbps burst rate indefinitely, 90e6 * 100k would take ~250 hours to download. And they'd cut you off long before that.

Worrying about shaving a few milliseconds here and there using asynchronous DNS is just a drop in the ocean.

(*)They seem to range from the minimalist google at 8k, up to the commercial bloat of sky.com at 250k; but 100k is a good average of the few I looked at.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^14: Async DNS with LWP
by jc (Acolyte) on Oct 08, 2010 at 15:48 UTC
    You raise some good points. What kind of throughput have you managed to implement with your solution? What bandwidth was available? I'm guessing you never saturated it, right?

      We had an early 4-cpu (real cpus not cores) SMP box with a (shared) 1Gbps link direct to the (a) backbone. We easily saturated that with 32 threads running bog standard LWP & Digest::MD5--provided we didn't store the data to disk. Even with raided (5 I think) disks, the bottleneck was storing what we could read. That was circa 7 years ago.

      To do the job properly at the scale you are talking about, you'd need to run a distributed crawler, each node with dedicated, high-speed, raided local drives--or hugely expensive SSD arrays, and a distributed queueing mechanism.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        With 32 synchronous LWP sessions you saturated 1Gbps bandwith? Am i understanding you right here? A 1 Gbps connection was saturated by downloading only 32 web pages at a time? What tools did you use to monitor bandwidth consumption?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864182]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-25 14:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found