Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Fastest way to download many web pages in one go?

by Tanktalus (Canon)
on Oct 11, 2013 at 22:12 UTC ( [id://1057955]=note: print w/replies, xml ) Need Help??


in reply to Fastest way to download many web pages in one go?

"parallel downloads" would scream for parallel processing ... with the only caveat that you want to communicate back to a single point (parent process) to produce a report. Now you need a way to communicate between processes.

There are many approaches there, too. The most basic is for each process to place its information into a file, and have the parent process read the files when everyone has exited and produce the report from there. Personally, I think temporary files are a kind of code smell, but could be convinced on a case-by-case basis.

There are similar methods - you could store your intermediary information in a database, for example. This smells less because there are many valid uses for that database - including job engine support, and then putting multiple reporters on top of it, e.g., one producing an email, another being a web page (producing HTML), another a command line output, whatever. We have a number of such job engines in our product at work, which changes the database from a minor smell to a core feature.

Or you can pipe all the intermediary information back to the parent process. This removes the temporary files, but introduces some IPC mechanisms which are not terribly different from reading from a file, but not exactly the same, though it can generally be close enough if the amount of data flowing from each subprocess is very small. The risk here being that as your application grows, the data may grow, and you may hit the case where it doesn't all fit in a buffer and you may be delayed, meanwhile you have three other subprocesses returning data waiting for you to clear their buffers ... and trying to figure all that out may get tricky. Fortunately there are modules that can help with this.

One such way is threading. However, due to the way threading is done in perl, this comes with its own set of gotchas and learning curve. Not impossible, but not entirely free, either.

Then there are the event-based approaches. You listed POE. I've been indoctrinated more into AnyEvent, but the general idea is the same: read your data with non-blocking calls, let the event handler worry about which handles have data on them, and wait. If processing all 30 files won't chew up much CPU, this turns out to be, IMO, an excellent choice. It works around all of the gotchas of threading, both the gotchas that are generic to threading (having to mutex write access to variables) and ones that are specific to perl (sharing variables across threads), but its major downside is that everything happens in a single CPU - which is why there's the "if" there. If processing doesn't chew up much CPU and you're sitting there waiting on I/O (both network and disk), this can be a great way to go. The fun bit here is that if the processing also takes a long time, not just CPU time, because you're blocking while waiting for something, e.g., calling system, then you have to write all that to be non-blocking. Note that Coro can also help here in making your code a bit easier to read (IMO).

Personally, I'd probably start with AnyEvent::HTTP. Download all the files at the same time, and then process them, and put the final data into a hash or whatever. When everything is done, produce the report from that hash. If processing starts to take too much CPU time, then I would look at AnyEvent::Fork::RPC - the subprocess could do the fetching (possibly via LWP or via AnyEvent::HTTP) and process it, returning the results over RPC to the parent process.

Hope that helps.

  • Comment on Re: Fastest way to download many web pages in one go?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1057955]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-24 04:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found