Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Fastest way to download many web pages in one go?

by smls (Friar)
on Oct 11, 2013 at 19:37 UTC ( #1057944=perlquestion: print w/ replies, xml ) Need Help??
smls has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow Monks!

A Perl script I just started to write should download a bunch of web pages from two different domains (up to 30 pages from each domain), parse each page and collect certain information from it, and then at the very end (after all pages have been fetched and all info extracted), a "report" of sorts should be printed.

Assumptions:

  • All downloads are simple HTTP GET requests - no need for cookies etc.
  • The order in which downloads are finished is irrelevant - processing of each page can happen independently.
  • Linux only - no need to be cross-platorm.
  • Will run on systems with multi-core CPUs.

Problem: Performance

The script will be triggered manually, and each second it takes to complete is a second the user will spend twiddling their thumbs. Thus the desire to complete quickly.

Parsing will undoubtedly be very fast compared to downloading, so it is the latter where I'd really like to see some performance boost compared to simply doing sequential LWP::UserAgent requests.
My (limited) experience with this kind of stuff suggests that one or more of the following might really help:

  • DNS lookup caching
  • persistent HTTP connections
  • parallel downloads

(Please tell me if I'm missing something entirely.)

Solution: CPAN! Problem: What module to use?

Searching CPAN reveals many modules that seem to be able to help Perl developers with some of the above download acceleratiion techniques, including...

...which is kind of overwhelming.

If there are any Monks out there who have experience with this kind of problem, would you mind sharing some of it with your fellow acolyte? :)

To the point:
Which CPAN module, or combination of CPAN modules, or other solution, is known to provide the best performance and reliability for doing a whole bunch of GET requests against two different domains?

Comment on Fastest way to download many web pages in one go?
Re: Fastest way to download many web pages in one go?
by BrowserUk (Pope) on Oct 11, 2013 at 20:43 UTC

    Try something like Re: Perl crashing with Parallel::ForkManager and WWW::Mechanize. Adjust $T to be 3 or 4 times the number of cores available for best throughput.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
        Is there a significant difference (for this particular usecase) between that, and modules such as Thread::Queue and Parallel::ForkManager?

        For this use case -- the need to accumulate information from all the downloads together in order to produce the final report -- (for me) excludes forking solutions because of the complication of passing the extracted information back from child to parent.

        This either means:

        • effectively serialising the forks in order to retrieve it via pipes;
        • or; adding the complexity of a multiplexing server to the parent process to allow deserialised retrieval.

        That's more work than I wish to do; and puts a bottleneck at the end of the parallelisation.

        Thread::Queue on its own is not a solution to parallelisation, though it can form the basis of a thread pool solution.

        My choice of a new thread per download rather than a thread pool solution is based on the fact that you need to parse the retrieved pages.

        Thread pools work best when the work being done for each item is very small -- ie. takes less time than spawning a new thread. Once you need to wait network or internet times for the fetch and then parse the retrieved data, the time to spawn a new thread becomes insignificant, so spawning a new thread for each of your 60 pages becomes cost effective.

        The extracted data can easily be returned via the normal return statement from the threadproc and gathered in the parent via the threads::join() mechanism.

        Thus for each page the thread processing is a simple, linear flow of: fetch url; extract information; return extracts and end.

        For the main thread it is a simple loop over the urls spawning threads; mediated by a single shared variable to limit resource -- memory or bandwidth; whichever proves to be the limiting factor on your system -- and then a second loop over the thread handles retrieving the extracted data and pulling it together into a report.

        No scope for deadlocks; livelocks or priority inversions; no need for the complexities of multiplexing servers; no need for non-blocking, asynchronous reads; no user written buffering.

        In short: simplicity.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fastest way to download many web pages in one go?
by Tanktalus (Canon) on Oct 11, 2013 at 22:12 UTC

    "parallel downloads" would scream for parallel processing ... with the only caveat that you want to communicate back to a single point (parent process) to produce a report. Now you need a way to communicate between processes.

    There are many approaches there, too. The most basic is for each process to place its information into a file, and have the parent process read the files when everyone has exited and produce the report from there. Personally, I think temporary files are a kind of code smell, but could be convinced on a case-by-case basis.

    There are similar methods - you could store your intermediary information in a database, for example. This smells less because there are many valid uses for that database - including job engine support, and then putting multiple reporters on top of it, e.g., one producing an email, another being a web page (producing HTML), another a command line output, whatever. We have a number of such job engines in our product at work, which changes the database from a minor smell to a core feature.

    Or you can pipe all the intermediary information back to the parent process. This removes the temporary files, but introduces some IPC mechanisms which are not terribly different from reading from a file, but not exactly the same, though it can generally be close enough if the amount of data flowing from each subprocess is very small. The risk here being that as your application grows, the data may grow, and you may hit the case where it doesn't all fit in a buffer and you may be delayed, meanwhile you have three other subprocesses returning data waiting for you to clear their buffers ... and trying to figure all that out may get tricky. Fortunately there are modules that can help with this.

    One such way is threading. However, due to the way threading is done in perl, this comes with its own set of gotchas and learning curve. Not impossible, but not entirely free, either.

    Then there are the event-based approaches. You listed POE. I've been indoctrinated more into AnyEvent, but the general idea is the same: read your data with non-blocking calls, let the event handler worry about which handles have data on them, and wait. If processing all 30 files won't chew up much CPU, this turns out to be, IMO, an excellent choice. It works around all of the gotchas of threading, both the gotchas that are generic to threading (having to mutex write access to variables) and ones that are specific to perl (sharing variables across threads), but its major downside is that everything happens in a single CPU - which is why there's the "if" there. If processing doesn't chew up much CPU and you're sitting there waiting on I/O (both network and disk), this can be a great way to go. The fun bit here is that if the processing also takes a long time, not just CPU time, because you're blocking while waiting for something, e.g., calling system, then you have to write all that to be non-blocking. Note that Coro can also help here in making your code a bit easier to read (IMO).

    Personally, I'd probably start with AnyEvent::HTTP. Download all the files at the same time, and then process them, and put the final data into a hash or whatever. When everything is done, produce the report from that hash. If processing starts to take too much CPU time, then I would look at AnyEvent::Fork::RPC - the subprocess could do the fetching (possibly via LWP or via AnyEvent::HTTP) and process it, returning the results over RPC to the parent process.

    Hope that helps.

Re: Fastest way to download many web pages in one go?
by ig (Vicar) on Oct 12, 2013 at 09:56 UTC
    Parsing will undoubtedly be very fast compared to downloading...

    If parsing takes negligible time compared with DNS lookup and getting the documents, I would explore and accept or rule out asynchronous DNS and HTTP solutions before pursuing parallel processing, but you don't say much about the complexity of the parsing or where your bottlenecks are.

    Years ago I customized a very nice C library for parallel DNS queries and interfaced it to Perl, but I was doing millions of lookups per job. It may have been ADNS, accessible via Net::ADNS but I don't recall with any certainty. If I understand correctly, you are only dealing with two domain names. Unless the name to address resolution is liable to change, you might do best to hard code the IP addresses and dispense with DNS and its delays altogether.

    This only leaves the HTTP requests to execute in parallel. Depending on the volume you are downloading, asynchronous requests might suffice. Unless you have multiple NICs to go with your multiple CPUs, it's not clear that parallel processes would be much benefit to the download time. Will it take longer to process a packet than it takes to receive it? I haven't used anything for asynchronous HTTP requests recently, but a quick search reveals HTTP::Async which looks like it might be worth a try, in addition to LWP::Parallel::UserAgent which you already found.

    If all your HTTP requests are done in parallel, persistent connections would be irrelevant: each connection would handle a single request. On the other hand, if you are also downloading linked resources (you don't say) then persistent connections might help.

    If, after all, parsing time is not negligible, then you will have the challenges of parallel processing and IPC, but this can be dealt with independent of the download issue.

    I would start with simple solutions and experiment with more complex options only if the simple ones prove to be inadequate, at which point I would have more specific problems to deal with.

Re: Fastest way to download many web pages in one go?
by zentara (Archbishop) on Oct 12, 2013 at 11:03 UTC
    If you want speed AND separate processes which can work in parallel, a simple solution would be to fork off separate wget requests. wget is a powerful, feature rich, commandline web retreiver, and will work well in parallel forks, as long as you have enough bandwidth for them to share.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: Fastest way to download many web pages in one go?
by Lennotoecom (Monk) on Oct 12, 2013 at 19:19 UTC
    here is the small example i wrote for you
    it parses all hrefs from the original page
    and downloading them into directory pages
    forks at every link:
    use LWP::Simple; $a = get("http://perlmonks.org/index.pl?"); (getLink($1)) while ($a =~s/a href=\"(http:\/\/.+)\"//); sub getLink{ if($pid = fork()){ $_ = shift; $filename = $& if /(?<=http:\/\/)[\w+|\.|\d+]+/; open OUT,">pages/$pid-$filename" or die $!; print OUT get($_); close OUT; exit(0); } }
    please correct mistakes if you notice anything
    i'll accept your corrections with all gratitude
    and humility
    thank you

      What happens when you run it?


      Dave

        well, it gets the first pointed http page
        and then uncontrollably forks at every "a href" link on that page
        getting and saving it to the local dir,
        naming files $pid-domainname.domain
Re: Fastest way to download many web pages in one go?
by nonsequitur (Beadle) on Oct 13, 2013 at 01:53 UTC
Re: Fastest way to download many web pages in one go?
by sundialsvc4 (Monsignor) on Oct 15, 2013 at 14:09 UTC

    As an aside, at one “shop” where I was working, they had a variation of the Unix execargs command which supported pooling.   It was just an -n number_of_children parameter (or something like that ...), but it sure was useful.   The command worked in the usual way ... read lines from STDIN and execute a command with that line in its argument-string ... but it supported n children doing the commands simultaneously.   Each child ran, did its thing, and then died.   Maybe this is a standard feature ... I don't know ... but it cropped up everywhere in the stuff that they were doing, as a useful generalization.   Here, you’d feed it a file containing a list of URLs and use it to drive a command that took one URL as a command-line parameter, retrieved and processed it.   Since each process would actually spend most of its time waiting for some host to respond, you could run a very large number of ’em.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1057944]
Approved by BrowserUk
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (16)
As of 2014-07-14 13:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (260 votes), past polls