Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^6: Async DNS with LWP

by jc (Acolyte)
on Oct 06, 2010 at 21:39 UTC ( #863870=note: print w/ replies, xml ) Need Help??


in reply to Re^5: Async DNS with LWP
in thread Async DNS with LWP

OK,

so at this point I'm now thinking:

* LWP and Mechanize are nice toys to make a quick proof of concept of a real web crawler but in practise not useful for anything more than low bandwidth automated tasks.

* With AnyEvent::HTTP and Coro you can make a proof of concept which performs better but you're still not quite there

* In order to build a real performing parallel web crawler that makes the best use of network resources performing parallel asynchronous DNS and parallel HTTP requests then I either need to use Perl's bloated thread model and directly use Perl's UDP and TCP interface or I need to give up on Perl and go ahead and build this in C

It really seems a shame that there are so many Perl modules dedicated to crawling tasks and yet none of them really have proved up to the job of being the back end of a high performance crawler that makes best use of network resources. The fact that people have dedicated so much time to making such modules would seem to suggest that many Perl users have an interest in web crawling. I'm wondering (new to PerlMonks, please help me out here) if there's anything we can do to set up a team of Perl developers that can improve the situation and develop easy to use Perl modules that are up to the job?


Comment on Re^6: Async DNS with LWP
Re^7: Async DNS with LWP
by rcaputo (Chaplain) on Oct 07, 2010 at 04:41 UTC
Re^7: Async DNS with LWP
by BrowserUk (Pope) on Oct 07, 2010 at 06:59 UTC
    I either need to use Perl's bloated thread model and directly use Perl's UDP and TCP interface or

    If you can afford to pay for sufficient bandwidth to allow for serious web-crawling, then affording a box with sufficient memory to start enough threads to saturate that bandwidth, will be the least of your concerns.

    I have 4GB of ram and I can run hundreds, even thousands of threads without getting anywhere near to running out of memory. So the "bloat" of the ithreads model is neither here nor there.

    Personally, I'd forget about asynchronous DNS. I'd stick an LWP::Parallel::UserAgent instance in one thread per core, and and watch them totally saturate my bandwidth. No Matter how fat a pipe I can afford.

    This trivial demo is running 100 threads, each with a parallel user agent, on this box as I type, in just 1/2 GB of memory:

    #! perl -slw use strict; use threads ( stack_size => 4096 ); use Thread::Queue; use LWP::Parallel; sub worker { my $tid = threads->tid; my( $Qin, $Qout ) = @_; my $ua = LWP::Parallel::UserAgent->new; print "Thread: $tid ready to go"; while( defined( my $url = $Qin->dequeue ) ) { print $url; } } our $T //= 4; my( $Qin, $Qout ) = map Thread::Queue->new(), 1 ..2; my @workers = map async( \&worker, $Qin, $Qout ), 1 .. $T; sleep 100; ## Read your urls and feed the Q here.

    Not that running that many threads on my 4 cores, would be an effective strategy, but even if you're running on one of IBMs $250,000, 256-core, 1024 thread monsters, affording 5GB of memory so that you can run one parallel useragent on each core, is the least of your worries, but the 25 lines of code above will scale to it. AS-IS.

    And that's what you get with threads. Simplicity and scalability.

    But that bit is easy.

    The complicated part of a high throughput webcrawler is not saturating the bandwidth. The complicated parts are:

    1. respecting robots.txts;
    2. an efficient url extractor that can deal with not just html hyperlinks, but all the other kinds of absolute and relative links you need to discover and follow.
    3. scheduling urls so that you don't hit up any particular server with thousands of (concurrent or serial) requests in an effective DoS attack.
    4. having enough disk bandwidth to allow you to write the stuff out without holding everything up.
    5. having an efficient indexing/digesting mechanism to stop you chasing your tail in loops of cross-referenced pages.

      And that means indexing (digesting) the content, not just the urls, because the same content can hide behind many different urls.

    6. And the indexing/digesting mechanism will have to be disk-based--for both persistance and size-- but must not impact the to-disk throughput of your workers.

    Yes. Saturating your bandwidth is trivial, it is the rest that is hard. Worrying about asynchronous DNS at this point is premature and pointless.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi,

      actually I'd already solved most of these problems and that's why I was hammering on with saturating bandwidth. I have a list of 90,000,000 registered .com domains and my breadth first policy ensures that I'm not hitting the same server repeatedly. In fact, I'm crawling so many domains that I don't think even the most rigorous analysis of server logs would flag that my browsing looks like an intensive crawl.

      I use the URI module to resolve all URIs to absolute URIs and so this is not a problem. No duplicate absolute URIs are added to my to do list.

      My disk I/O policy is also pretty stable. I cache results in memory in a hash table. When that hash table has grown to its user defined memory limit then the in memory hash table is used to update values in an on disk version of the hash table using MLDBM. The theory behind this strategy is that the I/O bottleneck is reduced by lumping writing into one big set of writes.

      As it stands I'm trying to build a list of links from the front pages of all registered .com domains. I don't want to have to wait 3 years to do this (my current estimate based on a serial version based on LWP). If I could reduce this to 3 days or 3 hours I would be a very happy chappy.

      Now, I've taken a quick look at your example code but notice that you are not actually doing anything with LWP. You've consumed 1/2 GB with only 100 threads that are not performing any TCP communication. The moment you start doing TCP the TCP/IP stack of whatever OS you are using will start consuming even more memory resources. (Note that the small webbook I am developing on has about 1/2 GB to work with).

      Now using a Async::HTTP event driven model I've managed to get an average of about 50 concurrent TCP connections working (on Windows 7, I'm still not sure about the internal limitations of Windows 7 TCP and how I can tweak these, the model has changed from XP and the old registry keys are no longer valid). Note that I am consuming no where near the amount of memory you quoted for your 100 threads.

      Now, I'm willing to accept that the bloat in your example is probably from loading fat LWP instances and not from the process like thread model used by Perl. So I'll experiment with Threads and Sockets before giving up on Perl for this job.

      In conclusion, I'm wondering at this point how difficult it would be to reimplement WWW::Mechanize as a thread safe wrapper around optimized C code. My current experiments seem to suggest that this is what it would take to put Perl in the running for a simple interface with a high performance back end. As it stands, LWP is fat, and Mechanize is even fatter. The easier the implementation the fatter and slower the web crawling application seems to be the current trend with Perl web crawling modules.

        Reimplementing the WWW::Mechanize API is not hard. I've done so with WWW::Mechanize::Firefox. Maybe you'll be better off by looking at the CURL modules, which provide a different but supposedly highly optimized API.

        Now, I've taken a quick look at your example code but notice that you are not actually doing anything with LWP. You've consumed 1/2 GB with only 100 threads that are not performing any TCP communication. The moment you start doing TCP the TCP/IP stack of whatever OS you are using will start consuming even more memory resources. (Note that the small webbook I am developing on has about 1/2 GB to work with).

        Yes. But the point is, I wouldn't use anything like 100 threads.

        Not unless I had 100 cores anyway, and not even then because I would reserve at least 50% of my cores for digesting and link extraction. I would not be doing that within my crawler. Why? Because--from experience of writing a high-throughput crawler--it doesn't make sense to go through the process of extracting links from a page until you've digested it so that you can check whether it has already been processed by another leg of the crawler. It just wastes cycles.

        It also doesn't make sense to cache urls in memory. When crashes happen--and they always do, especially if you are running on remote, hosted boxes with arbitrarily enforced management limits(*)--then you will inevitably loose work. And that costs time and money.

        (*) We had many crashes because the hoster had management software that would terminate processes if they variously: exceeded memory limits; exceeded diskIO limits; exceeded bandwidth limits; exceeded runtime limits. They deemed all of these to be likely to be "ran-away processes" and terminated them with prejudice. Retaining flow information in memory means loosing work and time.

        And using urls (alone) as the basis of your duplication elimination also doesn't work. Take PerlMonks for instance. There are (to my knowledge) at least 3 domain names for this place, and all pages are accessible via all the domains. Add the underlying IP and that makes 4 copies of every page that you'd fetch, parse and store unless you do something to eliminate the duplicates. And then 4 copies of every link on each of those pages; and then 4 copies of the links on each of those...

        You see where that is going.

        I don't know how much (per box) bandwidth your ISP is capable of providing you with, but throwing more than low multiples of threads per core at the problem is not the solution. Far better to use 1 thread per core (that you allocate to crawling) and run a parallel useragent in each thread fetching (say) 10 urls concurrently on each thread. That will easily max your bandwidth without total sucking up either memory or cpu.

        The crawler process digests (say MD5) the content and writes it to disk under the digest. It also writes the digest to a db-fetched queue table. Another process, read that queue, extracts links from the content and adds them to a to to-fetch-queue table. This is where the crawler gets its urls from.

        At each stage, the work is committed to the DB, so if the box goes down, it can pick up right from where it left off when it comes back up. By separating out the concerns of fetching, and processing, and de-duplication, you avoid doing make-work. And to balance the system you can adjust the number of threads in the crawler; then number of concurrent fetches in each of those threads; and the number of link extractor processes that you run. With a little ingenuity, you can even automate that process by having a management process that monitors the size of the inbound and outbound DB-Q tables and starts or kills link extractor processes to compensate.

        For a serious scale crawler, you;d need to be looking at multiple boxes each with it's own direct link to the network backbone--to avoid all the boxes being limited to the through put of some upstream choke point.

        But if you're looking for a single-box threaded solution, it still makes considerable sense to separate the concerns of fetching and link extraction. And to ensure that the ongoing work-flow state is committed to disk on-the-fly rather than a periodic points which will cost you time and work if processes or the whole box fails. Note. That doesn't necessarily mean a RDBMS, they have their own concurrency limitations unless your pocket stretches to a distributed setup.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      LWP::Parallel::UserAgent doesn't work with newer versions of LWP, or did that change?

        That's a shame. It's been quite a while since I've had occasion to use it. Does it not work with LWP::Parallel?

        However, if I did need to use it, I'd just revert to the last version of LWP with which it did work.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        FWIW: If you comment out line 1479 of Parallel::UserAgent.pm, it works fine.

        The best as I can tell, all you'll be missing is a few obscure fields in the protocol object that noboby uses anyway. And there seems to be some debate at wc3 as to whether servers should have to provide them or not.

        For my purposes, it is a complete no-brainer to just ignore the problem.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863870]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-07-13 00:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (243 votes), past polls