Re^4: Async DNS with LWP

Replies are listed 'Best First'.
Re^5: Async DNS with LWP by BrowserUk (Patriarch) on Oct 05, 2010 at 22:34 UTC
Not sure I see what you mean by your point about Coro. Coro isn't threaded! (Despite the blatant lies in the documentation!). It is cooperative task-switching--like Windows 3.1--which means that if one of your coro instances is busy, none of the others will do anything at all until it either: finishes; goes into a wait for IO; or yields. It also means that regardless of how many cores you have, it will only ever use one of them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^6: Async DNS with LWP by jc (Acolyte) on Oct 06, 2010 at 21:39 UTC
OK, so at this point I'm now thinking: * LWP and Mechanize are nice toys to make a quick proof of concept of a real web crawler but in practise not useful for anything more than low bandwidth automated tasks. * With AnyEvent::HTTP and Coro you can make a proof of concept which performs better but you're still not quite there * In order to build a real performing parallel web crawler that makes the best use of network resources performing parallel asynchronous DNS and parallel HTTP requests then I either need to use Perl's bloated thread model and directly use Perl's UDP and TCP interface or I need to give up on Perl and go ahead and build this in C It really seems a shame that there are so many Perl modules dedicated to crawling tasks and yet none of them really have proved up to the job of being the back end of a high performance crawler that makes best use of network resources. The fact that people have dedicated so much time to making such modules would seem to suggest that many Perl users have an interest in web crawling. I'm wondering (new to PerlMonks, please help me out here) if there's anything we can do to set up a team of Perl developers that can improve the situation and develop easy to use Perl modules that are up to the job?	[reply]
Re^7: Async DNS with LWP by BrowserUk (Patriarch) on Oct 07, 2010 at 06:59 UTC
I either need to use Perl's bloated thread model and directly use Perl's UDP and TCP interface or If you can afford to pay for sufficient bandwidth to allow for serious web-crawling, then affording a box with sufficient memory to start enough threads to saturate that bandwidth, will be the least of your concerns. I have 4GB of ram and I can run hundreds, even thousands of threads without getting anywhere near to running out of memory. So the "bloat" of the ithreads model is neither here nor there. Personally, I'd forget about asynchronous DNS. I'd stick an LWP::Parallel::UserAgent instance in one thread per core, and and watch them totally saturate my bandwidth. No Matter how fat a pipe I can afford. This trivial demo is running 100 threads, each with a parallel user agent, on this box as I type, in just 1/2 GB of memory: `#! perl -slw use strict; use threads ( stack_size => 4096 ); use Thread::Queue; use LWP::Parallel; sub worker { my $tid = threads->tid; my( $Qin, $Qout ) = @_; my $ua = LWP::Parallel::UserAgent->new; print "Thread: $tid ready to go"; while( defined( my $url = $Qin->dequeue ) ) { print $url; } } our $T //= 4; my( $Qin, $Qout ) = map Thread::Queue->new(), 1 ..2; my @workers = map async( \&worker, $Qin, $Qout ), 1 .. $T; sleep 100; ## Read your urls and feed the Q here.` [download] Not that running that many threads on my 4 cores, would be an effective strategy, but even if you're running on one of IBMs $250,000, 256-core, 1024 thread monsters, affording 5GB of memory so that you can run one parallel useragent on each core, is the least of your worries, but the 25 lines of code above will scale to it. AS-IS. And that's what you get with threads. Simplicity and scalability. But that bit is easy. The complicated part of a high throughput webcrawler is not saturating the bandwidth. The complicated parts are: respecting robots.txts; an efficient url extractor that can deal with not just html hyperlinks, but all the other kinds of absolute and relative links you need to discover and follow. scheduling urls so that you don't hit up any particular server with thousands of (concurrent or serial) requests in an effective DoS attack. having enough disk bandwidth to allow you to write the stuff out without holding everything up. having an efficient indexing/digesting mechanism to stop you chasing your tail in loops of cross-referenced pages. And that means indexing (digesting) the content, not just the urls, because the same content can hide behind many different urls. And the indexing/digesting mechanism will have to be disk-based--for both persistance and size-- but must not impact the to-disk throughput of your workers. Yes. Saturating your bandwidth is trivial, it is the rest that is hard. Worrying about asynchronous DNS at this point is premature and pointless. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^8: Async DNS with LWP by jc (Acolyte) on Oct 07, 2010 at 08:49 UTC
Re^9: Async DNS with LWP by Corion (Patriarch) on Oct 07, 2010 at 08:53 UTC
Re^9: Async DNS with LWP by BrowserUk (Patriarch) on Oct 07, 2010 at 09:56 UTC
Some notes below your chosen depth have not been shown here
Re^8: Async DNS with LWP by ikegami (Patriarch) on Oct 07, 2010 at 20:32 UTC
Re^9: Async DNS with LWP by BrowserUk (Patriarch) on Oct 08, 2010 at 12:18 UTC
Re^9: Async DNS with LWP by BrowserUk (Patriarch) on Oct 07, 2010 at 23:33 UTC
Some notes below your chosen depth have not been shown here
Re^7: Async DNS with LWP by rcaputo (Chaplain) on Oct 07, 2010 at 04:41 UTC
POE::Component::Client::HTTP and LWP::UserAgent::POE will also perform better than a serial crawler, but they're still single-core solutions. `-- Rocco Caputo - http://poe.perl.org/ - irc://irc.perl.org/poe`	[reply]


Problems? Is your data what you think it is?
	PerlMonks