http://www.perlmonks.org?node_id=639262

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on a replacement website for Plucker, and one of the features of it is a live pull of some of the Project Gutenberg etexts from their today.rss feed and their Top 100 list of electronic texts.

So far, this works great. I've even worked out a slight caching mechanism to only query the upstream data when it has changed.

From this data, I build an HTML table that links to several versions of the etext, for our users. That data looks like this:

PlaceEtext #Book TitleDownload as...
1 22617Chambers's Edinburgh Journal, No. 454 by Variouspdb html txt
2 22621The New England Magazine, Volume 1, No. 1, January 1886 by Variouspdb html txt
3 22610Punch, or the London Charivari, Vol. 150, January 19, 1916 by Variouspdb html txt
4 22612Punch, or the London Charivari, Vol. 150, January 26, 1916 by Variouspdb html txt
5 22611The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymouspdb html txt
6 22609The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowellpdb html txt
7 22619International Copyright by George Haven Putnampdb html txt
8 22614A Pavorosa Illusão by Manuel Maria Barbosa du Bocagepdb htmltxt
9 22616Salve, Rei! by Camilo Castelo Brancopdb htmltxt
10 22604Children and Their Books by James Hosmer Pennimanpdb html txt

In the above table, you can see that some elements are striked out. This is done with the following snippet of code:

my %gutentypes = ( plucker => { 'mirror' => "http://www.gutenberg.org/cache/plucker/ +$1/$1", 'content-type' => 'application/prs.plucker', 'string' => 'Plucker', 'format' => 'pdb' }, html => { 'mirror' => "http://www.gutenberg.org/dirs/$splitgut +en/$1/$1-h/$1-h.htm", 'content-type' => 'text/html', 'string' => 'Marked-up HTML', 'format' => 'html' }, text => { 'mirror' => "http://sailor.gutenberg.lib.md.us/$spli +tguten/$1/$1.txt", 'content-type' => 'text/plain', 'string' => 'Plain text', 'format' => 'txt' }, ); for my $types ( sort keys %gutentypes ) { my ($status, $type) = test_head($gutentypes{$types}{mirror}); if ($status == 200) { $gutentypes{$types}{link} = qq{<a href="$gutentypes{$types +}{mirror}">$gutentypes{$types}{format}</a>\n}; } else { $gutentypes{$types}{link} = qq{<strike>$gutentypes{$types} +{format}</strike>}; } } sub test_head { my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('pps Plucker Perl Spider, v0.1.83 [rss]'); my $request = HTTP::Request->new(HEAD => $url); my $response = $ua->request($request); my $status = $response->status_line; my $type = $response->header('Content-Type'); my $content = $response->content; $status =~ m/(\d+)/; return ($1, $type); }

The number of items shown in the list, is controlled with a scalar I set for maximum, and an array slice for my $line (@gutenbooks[0 .. $maximum-1]) {...}.

The more books I want to show, the longer it takes for the page to draw, because I'm doing a HEAD request on every title 3 times (plucker, html, text), and linking/striking-out accordingly.

If I display 15 titles, that's at least 45 HEAD requests I have to make. It happens in under 2-5 seconds, depending on the latency to the mirror servers I'm pointing to, but it is still a delay. If one of those mirrors is not responding, the page load time could take forever (or until the remote end or user's browser times out).

I looked into using HTTP::Lite, HTTP:GHTTP, LWP::Simple and others to try to speed it up, but straight LWP::UserAgent was far-and-away the fastest (by about 3x), so I'm back to the drawing board.

I also looked into using LWP::Parallel::UserAgent and/or LWP::Parallel::ForkManager, but they're a bit more complex than I'd hoped (registering the links, then passing through a callback, etc.)

This was briefly discussed in the CB yesterday and bart (I think, forgive me if I have the wrong monk), suggested that I just check HEAD every hour/day or at some interval, unrelated to the user's request of the same page, and store the results in a database, and have my script always query the database, instead of hitting the remote urls directly every time my page is requested. He's right to a point... 45 or 100 or 200 database queries is MUCH faster than issuing a new HEAD request 3 times for each title displayed.

After thinking about this, it presents a few possible problems:

  1. If I check the links at 1am in a cron(1) job on the server-side, and a user visits the page at 7pm that night, the links may be down/invalid/changed/redirected.
  2. Coupling my script to the system processes (i.e. a cron job), doesn't make it a clean and portable as I'd like, if I have to move it from system to system (it also doesn't easily allow me to move it to an upstream hosting provider where I may not have access to cron).

Another suggestion was that I use some AJAX glue, and let the end-user's browser figure out which links were dead or not, ONLY when they decide to click upon them.

This too, presents some problems:

  1. It limits the feature to those with a browser supporting Javascript (and having it enabled, from what I understand, a shrinking minority)
  2. It does not work in text-mode browsers or for web spiders
  3. Visually, there is no indication of which titles are available in that format or not.

Is there an easier way to do this, so the end-user experience is not so hampered?

Replies are listed 'Best First'.
Re: Speeding up/parallelizing hundreds of HEAD requests
by perrin (Chancellor) on Sep 16, 2007 at 18:15 UTC

    The easiest way to do the caching is to use Cache::FastMmap. It's very fast and you can set the timeout to be whatever you like.

    Caching isn't very useful if the requests vary widely, i.e. if people don't tend to request the same page again within your timeout period. In that case, you really would want to run these requests in parallel. Parallel::ForkManager probably is the easiest way to do this, but forking a hundred processes may hurt a bit. The non-blocking I/O approach is more complicated but easier on your system. That's what LWP::Parallel does. There are other implementations, like HTTP::Async. All of them are more complicated than vanilla LWP, and I think that's unavoidable.

      I must have missed something in your reply.. how exactly is a local shared memory cache on a mmap'd file, going to help me speed up external HEAD requests to dozens/hundreds of separate resources over HTTP?

      For access to my templates and local files, sure, I can see how this would help (but so does Memoize, HTML::Template::Compiled, and so on), but I'm not sure where this helps speed up remote requests.

        You mentioned caching the requests instead of actually doing them every time. This is a good way to do that. It's more efficient than a bunch of database queries, and makes it very easy to control timeouts on the cache, in case you only want to check every 2 hours or so.
Re: Speeding up/parallelizing hundreds of HEAD requests
by NetWallah (Canon) on Sep 16, 2007 at 17:10 UTC
    This situation could be viewed as a cacheing issue, and addressed that way, without delving into the complexities of parallel processing.

    If the source data is relatively static, and the requests are "browsing" style, then you potentially are retrieving the same data periodically, and cacheing would help.

    The simplest would be to use the Memoize module. Minimally more work, but more accurate would be to use Memoize::Expire.

         "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom

Re: Speeding up/parallelizing hundreds of HEAD requests
by shmem (Chancellor) on Sep 16, 2007 at 16:51 UTC
    I must admit I haven't read all your node, but from the title only - have a look at POE.

    It contains examples on how to do this kind of tasks.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Speeding up/parallelizing hundreds of HEAD requests
by aquarium (Curate) on Sep 17, 2007 at 04:56 UTC
    a fairly easy to implement cache that does not rely on extra code is to use squid. setup squid properly and configure your LWP requests to use squid as the proxy
    also, if the pdb format is always available...you could provide the other formats with your own software, if it doesn't crunch the server with too many requests.
    another idea = write the file links as dynamic javascript that does a HEAD and crosses out unavailable formats. this shifts the connections to the client...so if their search returned lots of links, it'll be up to their machine to resolve the availability of the file formats. this also makes certain that the links are truly available from the client, not just from your server.
    the hardest line to type correctly is: stty erase ^H

      Unfortunately, the latest versions of Squid are not SMP-aware (as referenced by their core developers), and running it in front of Apache2 yields a significant performance decrease.

      I did a lot of thorough tests on this exact point. I've run Squid in front of Apache 1.3.x for years, and found roughly a 400% increase in request response time on a uniprocessor machine.

      When I moved to Apache 2 on a dual-core SMP machine, I tested Squid in front of Apache 2.x, and found that my request responses dropped 75% as compared to Apache 2.x running natively on port 80. Apache is able to thread processes across multiple cores, but Squid is not.

      I do, however.. have an internal Squid server running on my BSD machine, which ALL outbound traffic going across port 80 is transparently redirected through (redirected at the router by some iptables rules), so my HEAD requests are already going there. I don't see any significant increase or decrease in performance when enabling or disabling that capability.

      It is an interesting idea, but I don't think it applies to this specific problem.

Re: Speeding up/parallelizing hundreds of HEAD requests
by eric256 (Parson) on Sep 17, 2007 at 20:17 UTC

    Instead of caching on a cron basis cache on a request basis. The first persons visit will be slow, but then for the next X hours they will be fast, then one user is slow. If the links arn't always the same then you might even get to distribute that load over multiple users.

    So then the flow is, check the database for a link, if it isn't there request it now and get the status of the links. If it is, check its expiration, if its expired then fetch it now. Now use the data in the database to render your page. (Database can be anything persistent between different connections to the web server.


    ___________
    Eric Hodges
Re: Speeding up/parallelizing hundreds of HEAD requests
by hacker (Priest) on Sep 17, 2007 at 20:30 UTC

    I'm taking another approach to this problem... based on the comments from theorbtwo. The current code looks like this:

    sub gimme_guten_tables { my ($decoded, $maximum) = @_; $decoded =~ s,<li>\n(.*?)\n</li>,$1,g; $decoded =~ s,(.*?)<br><description>.*?</description>,$1,g; $decoded =~ s,<ul>(.*?)</ul>,$1,g; $decoded =~ s,<li>(.*?)</li>,$1,g; $decoded =~ s,<\/?ol>,,g; $decoded =~ s,<html xmlns:rss="http://purl.org/rss/1.0/"><body><ul> +,,; $decoded =~ s,</ul></body></html>\n.*,,; $decoded =~ s,^\n<a,<a,g; my @gutenbooks = ($decoded =~ /([^\r\n]+)(?:[\r\n]{1,2}|$)/sg); my $guten_tables; my ($link_status, $plkr_type, $html_type, $text_type); my $count = 1; for my $line (@gutenbooks[0 .. $maximum-1]) { if ($line && $line =~ m/href=".+\/(\d+)">(.*?)(?: \((\d+)\))?<\/ +a>/) { my $splitguten = join('/', split(/ */, $1)); my $clipguten = substr($splitguten, -2, 2, ''); my $readmarks = $3 ? $3 : $1; my $title = $2; $title =~ s,by (.*?)</a>,</a> by $1,g; my %gutentypes = ( plucker => { 'mirror' => "http://www.gutenberg.org/cache/pluck +er/$1/$1", 'content-type' => 'application/prs.plucker', 'string' => 'Plucker', 'format' => 'pdb' }, html => { 'mirror' => "http://www.gutenberg.org/dirs/$split +guten/$1/$1-h/$1-h.htm", 'content-type' => 'text/html', 'string' => 'Marked-up HTML', 'format' => 'html' }, text => { 'mirror' => "http://sailor.gutenberg.lib.md.us/$s +plitguten/$1/$1.txt", 'content-type' => 'text/plain', 'string' => 'Plain text', 'format' => 'txt' }, ); for my $types ( sort keys %gutentypes ) { my ($status, $type) = test_head($gutentypes{$types}{mirror}); if ($status == 200) { $gutentypes{$types}{link} = qq{<a href="$gutentypes{$types}{mirror}">$gutentypes{$t +ypes}{format}</a>\n}; } else { $gutentypes{$types}{link} = qq{<s>$gutentypes{$types}{format}</s>}; } } $guten_tables .= qq{<tr> <td width="40" align="center">$count</td> <td width="40" align="right">$readmarks</td> <td width="500"> <a href="http://www.gutenberg.org/etext/$1">$title</a> </td> <td align="center">$gutentypes{plucker}{link}</td> <td align="center">$gutentypes{html}{link}</td> <td align="center">$gutentypes{text}{link}</td> </tr>\n}; $count++; } } $guten_tables =~ s,\&,\&amp;,g; $guten_tables =~ s,>\n\s+<,><,g; return $guten_tables; } sub test_head { my $url = shift; my $ua = LWP::UserAgent->new(); $ua->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1;) Firefox/2.0.0 +.6'); my $request = HTTP::Request->new(HEAD => $url); my $response = $ua->request($request); my $status = $response->status_line; my $type = $response->header('Content-Type'); my $content = $response->content; $status =~ m/(\d+)/; return ($1, $type); }

    In this code, I'm taking an array, @gutenbooks, splitting out the etext id ($1) and the etext title ($2), and creating a hash of the 3 different formats of that work (pdb, html, txt).

    For each link I create, I pass it through test_head(), and check to see if it returns a '200' status or not. If the link is a '200' (i.e. exists, and is valid), I create a clickable link to it. If the link is NOT '200', then I don't link to it (i.e. I don't create a link that the user can click, to get a 404 or missing document).

    What I'd like to try to implement, is a way to take all of the links at once, pass them into some sub, and parallelize the HEAD check across them and return answers based on that check.

    But here is where I'm stuck...

    1. How do I take the single urls coming out of my match function, build a hash of them
    2. How do I then pass that hash to "something", which can then check the validity (in some random order?)
    3. How do I keep track of the responses returned from that check, maintaining integrity, so I can link/unlink the entry in the table I'm outputting?

    I have no experience with LWP::Parallel, LWP::ParallelUA, LWP::Parallel::ForkManager and the like (passing references, callbacks, etc.)

    Can some monk give me a strong nudge in the right direction?

    The docs for these modules assume I am just statically definiing the urls I want to check... and I can't do that; everything will be coming out of a dynamic, ever-changing array.

    Thanks.

      You could add two lines to your code above to achieve your goal.

      ... async{ ... } ...

      Of course, a complete solution would add a few more lines in order to terminate slow or absent mirrors. And a couple (2 or 3) more to share the results of the asynchronous calls with the main thread of the code.

      The total absence of the word "threads" from your question and responses suggests that you will not consider such a solution...and I've gotton out of the habit of expending time producing and testing solutions that will likely simple be ignored. But for the problem you are trying to solve, threads is the simplest, fastest, easiest to understand solution.

      It is also the case that I am not currently in a position to offer a tested solution, and unfortunate that even those here that do not dismiss threads as a viable solution, rarely seem to offer code.

      C'est la vie.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.