hacker has asked for the wisdom of the Perl Monks concerning the following question:
I've been working on a replacement website for Plucker, and one of the features of it is a live pull of some of the Project Gutenberg etexts from their today.rss feed and their Top 100 list of electronic texts.
So far, this works great. I've even worked out a slight caching mechanism to only query the upstream data when it has changed.
From this data, I build an HTML table that links to several versions of the etext, for our users. That data looks like this:
Place | Etext # | Book Title | Download as... | ||
---|---|---|---|---|---|
1 | 22617 | Chambers's Edinburgh Journal, No. 454 by Various | pdb | html | |
2 | 22621 | The New England Magazine, Volume 1, No. 1, January 1886 by Various | pdb | html | |
3 | 22610 | Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various | pdb | html | |
4 | 22612 | Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various | pdb | html | |
5 | 22611 | The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous | pdb | html | |
6 | 22609 | The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell | pdb | html | |
7 | 22619 | International Copyright by George Haven Putnam | pdb | html | |
8 | 22614 | A Pavorosa Illusão by Manuel Maria Barbosa du Bocage | pdb | ||
9 | 22616 | Salve, Rei! by Camilo Castelo Branco | pdb | ||
10 | 22604 | Children and Their Books by James Hosmer Penniman | pdb | html |
In the above table, you can see that some elements are striked out. This is done with the following snippet of code:
my %gutentypes = ( plucker => { 'mirror' => "http://www.gutenberg.org/cache/plucker/ +$1/$1", 'content-type' => 'application/prs.plucker', 'string' => 'Plucker', 'format' => 'pdb' }, html => { 'mirror' => "http://www.gutenberg.org/dirs/$splitgut +en/$1/$1-h/$1-h.htm", 'content-type' => 'text/html', 'string' => 'Marked-up HTML', 'format' => 'html' }, text => { 'mirror' => "http://sailor.gutenberg.lib.md.us/$spli +tguten/$1/$1.txt", 'content-type' => 'text/plain', 'string' => 'Plain text', 'format' => 'txt' }, ); for my $types ( sort keys %gutentypes ) { my ($status, $type) = test_head($gutentypes{$types}{mirror}); if ($status == 200) { $gutentypes{$types}{link} = qq{<a href="$gutentypes{$types +}{mirror}">$gutentypes{$types}{format}</a>\n}; } else { $gutentypes{$types}{link} = qq{<strike>$gutentypes{$types} +{format}</strike>}; } } sub test_head { my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('pps Plucker Perl Spider, v0.1.83 [rss]'); my $request = HTTP::Request->new(HEAD => $url); my $response = $ua->request($request); my $status = $response->status_line; my $type = $response->header('Content-Type'); my $content = $response->content; $status =~ m/(\d+)/; return ($1, $type); }
The number of items shown in the list, is controlled with a scalar I set for maximum, and an array slice for my $line (@gutenbooks[0 .. $maximum-1]) {...}.
The more books I want to show, the longer it takes for the page to draw, because I'm doing a HEAD request on every title 3 times (plucker, html, text), and linking/striking-out accordingly.
If I display 15 titles, that's at least 45 HEAD requests I have to make. It happens in under 2-5 seconds, depending on the latency to the mirror servers I'm pointing to, but it is still a delay. If one of those mirrors is not responding, the page load time could take forever (or until the remote end or user's browser times out).
I looked into using HTTP::Lite, HTTP:GHTTP, LWP::Simple and others to try to speed it up, but straight LWP::UserAgent was far-and-away the fastest (by about 3x), so I'm back to the drawing board.
I also looked into using LWP::Parallel::UserAgent and/or LWP::Parallel::ForkManager, but they're a bit more complex than I'd hoped (registering the links, then passing through a callback, etc.)
This was briefly discussed in the CB yesterday and bart (I think, forgive me if I have the wrong monk), suggested that I just check HEAD every hour/day or at some interval, unrelated to the user's request of the same page, and store the results in a database, and have my script always query the database, instead of hitting the remote urls directly every time my page is requested. He's right to a point... 45 or 100 or 200 database queries is MUCH faster than issuing a new HEAD request 3 times for each title displayed.
After thinking about this, it presents a few possible problems:
- If I check the links at 1am in a cron(1) job on the server-side, and a user visits the page at 7pm that night, the links may be down/invalid/changed/redirected.
- Coupling my script to the system processes (i.e. a cron job), doesn't make it a clean and portable as I'd like, if I have to move it from system to system (it also doesn't easily allow me to move it to an upstream hosting provider where I may not have access to cron).
Another suggestion was that I use some AJAX glue, and let the end-user's browser figure out which links were dead or not, ONLY when they decide to click upon them.
This too, presents some problems:
- It limits the feature to those with a browser supporting Javascript (and having it enabled, from what I understand, a shrinking minority)
- It does not work in text-mode browsers or for web spiders
- Visually, there is no indication of which titles are available in that format or not.
Is there an easier way to do this, so the end-user experience is not so hampered?
|
---|