Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^14: Async DNS with LWP

by jc (Acolyte)
on Oct 08, 2010 at 15:48 UTC ( [id://864219]=note: print w/replies, xml ) Need Help??


in reply to Re^13: Async DNS with LWP
in thread Async DNS with LWP

You raise some good points. What kind of throughput have you managed to implement with your solution? What bandwidth was available? I'm guessing you never saturated it, right?

Replies are listed 'Best First'.
Re^15: Async DNS with LWP
by BrowserUk (Patriarch) on Oct 08, 2010 at 16:14 UTC

    We had an early 4-cpu (real cpus not cores) SMP box with a (shared) 1Gbps link direct to the (a) backbone. We easily saturated that with 32 threads running bog standard LWP & Digest::MD5--provided we didn't store the data to disk. Even with raided (5 I think) disks, the bottleneck was storing what we could read. That was circa 7 years ago.

    To do the job properly at the scale you are talking about, you'd need to run a distributed crawler, each node with dedicated, high-speed, raided local drives--or hugely expensive SSD arrays, and a distributed queueing mechanism.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      With 32 synchronous LWP sessions you saturated 1Gbps bandwith? Am i understanding you right here? A 1 Gbps connection was saturated by downloading only 32 web pages at a time? What tools did you use to monitor bandwidth consumption?
        A 1 Gbps connection was saturated by downloading only 32 web pages at a time?

        Instantaneously yes. And for a substantial proportion of the time assuming the servers we were connected to, and their connections, were able supply their data at the required rate.

        Obviously if the mix of servers at any given point in time were all 386 class machines in people's back bedrooms, connected via 14.4k modems--not so uncommon back then--then throughput falls off. But usually you had a random mix of good and bad severs and it was sufficient to max the bandwidth available.

        Remember I mentioned the 1Gbps was shared. If I remember correctly, by 20 other hosts. Mostly they seemed to be using very little of the bandwidth. Probably low-volume websites running "Mum&Pop's Potpurri Emporium Inc." or "HaKzOr23's CRucIal SeCurITy SiGht". We weren't party to what they were, or what bandwidth they were using, but the hosters ControlPanel app showed us our usage, for which we were billed.

        By way of a convincer. The following two trivial scripts run as (2)servers and (2)clients on my 4 cpu machine. I set the affinities so that 2 cores are running the two server threads; and 2 the two client threads. All they do is connect to each other and shovel large lumps of date through from server to client as fast as they can:

        Server:

        #! perl -slw use strict; use threads; use threads::shared; use IO::Socket; $|++; my $status1 :shared = 0; my $status2 :shared = 0; my $server1 = async{ my $lsn = new IO::Socket::INET( Listen => 5, LocalPort => '12345' ) or die "Failed to open listening port: $!\n"; my $data = 'x' x 1024**2; while( my $c = $lsn->accept ) { while( 1 ) { print $c $data; ++$status1; } print "client disconnected"; } }; my $server2 = async{ my $lsn = new IO::Socket::INET( Listen => 5, LocalPort => '12346' ) or die "Failed to open listening port: $!\n"; my $data = 'x' x 1024**2; while( my $c = $lsn->accept ) { while( 1 ) { print $c $data; ++$status2; } print "client disconnected"; } }; while( Win32::Sleep 100 ) { printf "\r$status1 : $status2"; }

        Clients:

        #! perl -slw use strict; use threads; use threads::shared; use IO::Socket; $|++; my $bytes1 :shared = 0; my $bytes2 :shared = 0; my $client1 = async{ my $tid = threads->tid; my $svr = new IO::Socket::INET( 'localhost:12345' ) or die "Failed to connect to port: $!\n"; while( 1 ) { my $buffer = <$svr>; $bytes1 += length( $buffer ); } }; my $client2 = async{ my $svr = new IO::Socket::INET( 'localhost:12346' ) or die "Failed to connect to port: $!\n"; while( 1 ) { my $buffer = <$svr>; $bytes2 += length( $buffer ); } }; my( $last1, $last2 ) = (0,0); while( sleep 1 ) { my( $latest1, $latest2 ) = ( $bytes1 , $bytes2 ); printf "\rc1:%5d (%.3f Megabtes/second) c2:%5d (%.3f Megabtes/seco +nd)", $latest1, ( $latest1 - $last1 ) / 1024**2, $latest2, ( $latest2 - $last2 ) / 1024**2; ( $last1, $last2 ) = ( $latest1, $latest2 ); }

        That data doesn't go via the internet (my broadband connection is 300kbps at best); but it does go via the tcp stack and is therefore subject to all the handshaking, coalescing and buffering that a proper ip connection goes through.

        The main thread in the cients script monitors the throughput on a per second basis. Here's a typical snapshot of that:

        C:\test>junk62 c1:12559855306 (54.000 Megabtes/second) c2:12407811641 (53.000 Megabte +s/second)

        The cpu usages whilst all that data is flying about is about 12% each for the servers, and 5% each for the clients. The throughput varies up and down a bit between say 50MBytes/s and 58MBytes/s, but 53/54 is the norm.

        Remember, for 32 threads to sustain a combined throughput of 1Gbps (100MBytes/s), each thread has only to achieve 3MBytes/s.

        Obviously overall throughput at any given point will depend upon the mix of large and small files; good and bad servers; general network load; no of hops; and myriad other factors. But throwing large numbers of more threads at the problem has rapidly diminishing returns. 4 threads per CPU seemed optimal on that system at that time. 8 per cpu sometimes improved overall throughput, but that was mostly negated by the effects of thrashing the disks harder by writing to twice as many files concurrently.

        That's why I say that you have to consider the complete system. And also why async DNS doesn't make much difference.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://864219]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-04-16 04:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found