in reply to
Advice on Efficient Large-scale Web Crawling
Personaly, I think you're engaging in premature optimization here: when fetching 4M urls, the DNS traffic is unlikely to be your biggest concern.
Having said that, the cheapest/cleanest method would be to install a caching-only DNS server on your localhost, and let it handle the DNS caching.
Some reasons why your current solution might be slow:
- are all those 4 pages each in a flat file, and all the flat files in one directory? You'd be better off distributing them over a tree of directories.
- Do you have enough bandwidth to download all those pages? The line might be saturated with that much data. If you are connected through some asymetric line (like ADSL), your downloads could be chocked by the lack of bandwidth for the ACK traffic.
- Do you have enough memory for all the processes you've started? If your processes are being swapped out, they will not only be running more slowly as different processes are getting swapped in and out, but they'll probably compete for disk bandwidth with the files you're writing out.