Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^9: Async DNS with LWP

by BrowserUk (Pope)
on Oct 07, 2010 at 09:56 UTC ( #863969=note: print w/ replies, xml ) Need Help??

in reply to Re^8: Async DNS with LWP
in thread Async DNS with LWP

Now, I've taken a quick look at your example code but notice that you are not actually doing anything with LWP. You've consumed 1/2 GB with only 100 threads that are not performing any TCP communication. The moment you start doing TCP the TCP/IP stack of whatever OS you are using will start consuming even more memory resources. (Note that the small webbook I am developing on has about 1/2 GB to work with).

Yes. But the point is, I wouldn't use anything like 100 threads.

Not unless I had 100 cores anyway, and not even then because I would reserve at least 50% of my cores for digesting and link extraction. I would not be doing that within my crawler. Why? Because--from experience of writing a high-throughput crawler--it doesn't make sense to go through the process of extracting links from a page until you've digested it so that you can check whether it has already been processed by another leg of the crawler. It just wastes cycles.

It also doesn't make sense to cache urls in memory. When crashes happen--and they always do, especially if you are running on remote, hosted boxes with arbitrarily enforced management limits(*)--then you will inevitably loose work. And that costs time and money.

(*) We had many crashes because the hoster had management software that would terminate processes if they variously: exceeded memory limits; exceeded diskIO limits; exceeded bandwidth limits; exceeded runtime limits. They deemed all of these to be likely to be "ran-away processes" and terminated them with prejudice. Retaining flow information in memory means loosing work and time.

And using urls (alone) as the basis of your duplication elimination also doesn't work. Take PerlMonks for instance. There are (to my knowledge) at least 3 domain names for this place, and all pages are accessible via all the domains. Add the underlying IP and that makes 4 copies of every page that you'd fetch, parse and store unless you do something to eliminate the duplicates. And then 4 copies of every link on each of those pages; and then 4 copies of the links on each of those...

You see where that is going.

I don't know how much (per box) bandwidth your ISP is capable of providing you with, but throwing more than low multiples of threads per core at the problem is not the solution. Far better to use 1 thread per core (that you allocate to crawling) and run a parallel useragent in each thread fetching (say) 10 urls concurrently on each thread. That will easily max your bandwidth without total sucking up either memory or cpu.

The crawler process digests (say MD5) the content and writes it to disk under the digest. It also writes the digest to a db-fetched queue table. Another process, read that queue, extracts links from the content and adds them to a to to-fetch-queue table. This is where the crawler gets its urls from.

At each stage, the work is committed to the DB, so if the box goes down, it can pick up right from where it left off when it comes back up. By separating out the concerns of fetching, and processing, and de-duplication, you avoid doing make-work. And to balance the system you can adjust the number of threads in the crawler; then number of concurrent fetches in each of those threads; and the number of link extractor processes that you run. With a little ingenuity, you can even automate that process by having a management process that monitors the size of the inbound and outbound DB-Q tables and starts or kills link extractor processes to compensate.

For a serious scale crawler, you;d need to be looking at multiple boxes each with it's own direct link to the network backbone--to avoid all the boxes being limited to the through put of some upstream choke point.

But if you're looking for a single-box threaded solution, it still makes considerable sense to separate the concerns of fetching and link extraction. And to ensure that the ongoing work-flow state is committed to disk on-the-fly rather than a periodic points which will cost you time and work if processes or the whole box fails. Note. That doesn't necessarily mean a RDBMS, they have their own concurrency limitations unless your pocket stretches to a distributed setup.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^9: Async DNS with LWP
Re^10: Async DNS with LWP
by jc (Acolyte) on Oct 07, 2010 at 20:19 UTC
    As it stands, I'm developing this on a single core. The computer crashing hasn't been an issue for me but in order to minimise on repetition of work the state of the crawl is saved to disk each time the memory is filled and so, proportionally, not that much work would actually be repeated should my little box ever decide to crash. I do take your point, though, and will experiment further with writing to disk on the fly. I'm not sure what sort of optimisations you would propose to make writing quicker. As far as I know, in general, the only way of making writing to disk quicker is to attempt to write as much data in one go and to make those writes to consecutive space (not really possible for a hash table). Anyway, I'm not interested in duplicate content because I don't even process the content. The goal is to create a map of links on the internet. Whether there are a number of different roads that lead to the same location at this point does not concern me. What concerns me is to exhaustively map those roads. So, that brings us back to what my real present problem is. Making the best use of bandwidth available.

      Are you intending to move this to a beefier box at some point in the future? If so, what spec of box and what bandwidth will it have?

      If not, what bandwidth do you have on the current box?

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yes! Definitely! Ideally I would want as fat a bandwidth as possible. Ideally with lots and lots of memory. I'm negotiating with my University but that is unlikely to get anywhere fast. It looks like I'm going to have to fork out for a dedicated server myself. Seeing that I am concentrating on .com domains at the moment geographical location and connection factors of the box could be important. I was thinking something like These guys also seem to be the only ones I can find with decent OpenBSD dedicated servers. With 1,500 GB/month transfer and a maximum burstable speed of 100Mbps with no additional charge it looks like as good a deal as I'm likely to find. According to they seem to be pretty well connected as well. Especially for US based traffic. I've also used these guys before and their service was pretty good. They fix problems fast and don't charge for the service. However, as it stands I'm connected to the internet through a USB modem via a mobile phones operator that offers variable bandwidths (depending on where you are) and is limited to only 100 hours per month. In any case, the amount of time I crawl the net in is always going to ultimately limited by bandwidth and how best I use it. Asynchronous DNS and HTTP seem to be fundamental issues. I'm almost at the point where I'm thinking. Why didn't I just code this in C from the very beginning? I once made (years ago and I no longer have the code) an asynchronous DNS resolver in C and, yes, it did take much longer to implement but as far as I can remember it was as fast as the 100Mbps burstable speed could take. In fact, it ran so well that we had to redirect it to a better DNS server (a cluster of load balanced DNS servers) so that they could keep up with the requests. I've even thought that maybe it is a good idea to write my own recursive resolver to see if there are ways this process can be optimised. Maybe this is overkill but I want this to work in hours (maybe in days) but certainly not in years.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863969]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2014-09-18 13:02 GMT
Find Nodes?
    Voting Booth?

    How do you remember the number of days in each month?

    Results (115 votes), past polls