Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Speeding up/parallelizing hundreds of HEAD requests

by hacker (Priest)
on Sep 17, 2007 at 20:30 UTC ( #639482=note: print w/ replies, xml ) Need Help??


in reply to Speeding up/parallelizing hundreds of HEAD requests

I'm taking another approach to this problem... based on the comments from theorbtwo. The current code looks like this:

sub gimme_guten_tables { my ($decoded, $maximum) = @_; $decoded =~ s,<li>\n(.*?)\n</li>,$1,g; $decoded =~ s,(.*?)<br><description>.*?</description>,$1,g; $decoded =~ s,<ul>(.*?)</ul>,$1,g; $decoded =~ s,<li>(.*?)</li>,$1,g; $decoded =~ s,<\/?ol>,,g; $decoded =~ s,<html xmlns:rss="http://purl.org/rss/1.0/"><body><ul> +,,; $decoded =~ s,</ul></body></html>\n.*,,; $decoded =~ s,^\n<a,<a,g; my @gutenbooks = ($decoded =~ /([^\r\n]+)(?:[\r\n]{1,2}|$)/sg); my $guten_tables; my ($link_status, $plkr_type, $html_type, $text_type); my $count = 1; for my $line (@gutenbooks[0 .. $maximum-1]) { if ($line && $line =~ m/href=".+\/(\d+)">(.*?)(?: \((\d+)\))?<\/ +a>/) { my $splitguten = join('/', split(/ */, $1)); my $clipguten = substr($splitguten, -2, 2, ''); my $readmarks = $3 ? $3 : $1; my $title = $2; $title =~ s,by (.*?)</a>,</a> by $1,g; my %gutentypes = ( plucker => { 'mirror' => "http://www.gutenberg.org/cache/pluck +er/$1/$1", 'content-type' => 'application/prs.plucker', 'string' => 'Plucker', 'format' => 'pdb' }, html => { 'mirror' => "http://www.gutenberg.org/dirs/$split +guten/$1/$1-h/$1-h.htm", 'content-type' => 'text/html', 'string' => 'Marked-up HTML', 'format' => 'html' }, text => { 'mirror' => "http://sailor.gutenberg.lib.md.us/$s +plitguten/$1/$1.txt", 'content-type' => 'text/plain', 'string' => 'Plain text', 'format' => 'txt' }, ); for my $types ( sort keys %gutentypes ) { my ($status, $type) = test_head($gutentypes{$types}{mirror}); if ($status == 200) { $gutentypes{$types}{link} = qq{<a href="$gutentypes{$types}{mirror}">$gutentypes{$t +ypes}{format}</a>\n}; } else { $gutentypes{$types}{link} = qq{<s>$gutentypes{$types}{format}</s>}; } } $guten_tables .= qq{<tr> <td width="40" align="center">$count</td> <td width="40" align="right">$readmarks</td> <td width="500"> <a href="http://www.gutenberg.org/etext/$1">$title</a> </td> <td align="center">$gutentypes{plucker}{link}</td> <td align="center">$gutentypes{html}{link}</td> <td align="center">$gutentypes{text}{link}</td> </tr>\n}; $count++; } } $guten_tables =~ s,\&,\&amp;,g; $guten_tables =~ s,>\n\s+<,><,g; return $guten_tables; } sub test_head { my $url = shift; my $ua = LWP::UserAgent->new(); $ua->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1;) Firefox/2.0.0 +.6'); my $request = HTTP::Request->new(HEAD => $url); my $response = $ua->request($request); my $status = $response->status_line; my $type = $response->header('Content-Type'); my $content = $response->content; $status =~ m/(\d+)/; return ($1, $type); }

In this code, I'm taking an array, @gutenbooks, splitting out the etext id ($1) and the etext title ($2), and creating a hash of the 3 different formats of that work (pdb, html, txt).

For each link I create, I pass it through test_head(), and check to see if it returns a '200' status or not. If the link is a '200' (i.e. exists, and is valid), I create a clickable link to it. If the link is NOT '200', then I don't link to it (i.e. I don't create a link that the user can click, to get a 404 or missing document).

What I'd like to try to implement, is a way to take all of the links at once, pass them into some sub, and parallelize the HEAD check across them and return answers based on that check.

But here is where I'm stuck...

  1. How do I take the single urls coming out of my match function, build a hash of them
  2. How do I then pass that hash to "something", which can then check the validity (in some random order?)
  3. How do I keep track of the responses returned from that check, maintaining integrity, so I can link/unlink the entry in the table I'm outputting?

I have no experience with LWP::Parallel, LWP::ParallelUA, LWP::Parallel::ForkManager and the like (passing references, callbacks, etc.)

Can some monk give me a strong nudge in the right direction?

The docs for these modules assume I am just statically definiing the urls I want to check... and I can't do that; everything will be coming out of a dynamic, ever-changing array.

Thanks.


Comment on Re: Speeding up/parallelizing hundreds of HEAD requests
Select or Download Code
Re^2: Speeding up/parallelizing hundreds of HEAD requests
by BrowserUk (Pope) on Sep 17, 2007 at 22:11 UTC

    You could add two lines to your code above to achieve your goal.

    ... async{ ... } ...

    Of course, a complete solution would add a few more lines in order to terminate slow or absent mirrors. And a couple (2 or 3) more to share the results of the asynchronous calls with the main thread of the code.

    The total absence of the word "threads" from your question and responses suggests that you will not consider such a solution...and I've gotton out of the habit of expending time producing and testing solutions that will likely simple be ignored. But for the problem you are trying to solve, threads is the simplest, fastest, easiest to understand solution.

    It is also the case that I am not currently in a position to offer a tested solution, and unfortunate that even those here that do not dismiss threads as a viable solution, rarely seem to offer code.

    C'est la vie.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://639482]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-07-11 02:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (217 votes), past polls