perlquestion
hacker
<p>I've been working on a replacement website for <a href="http://www.plkr.org/">Plucker</a>, and one of the features of it is a live pull of some of the Project Gutenberg etexts from their <a href="http://www.gutenberg.org/feeds/today.rss">today.rss</a> feed and their <a href="http://www.gutenberg.org/browse/scores/top">Top 100</a> list of electronic texts.
<p>So far, this works great. I've even worked out a slight caching mechanism to only query the upstream data when it has changed.
<p>From this data, I build an HTML table that links to several versions of the etext, for our users. That data looks like this:
<p><table border="1" style="border-collapse: collapse;
border-color: #ccc;"> <tr align="center"><th class="guten">Place</th><th class="guten">Etext #</th><th class="guten">Book Title</th><th colspan="3" class="guten">Download as...</th></tr> <tr><td width="40" align="center" class="guten">1</td> <td width="40" align="right" class="guten">22617</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22617" title="Chambers's Edinburgh Journal, No. 454 by Various from Project Gutenberg">Chambers's Edinburgh Journal, No. 454 by Various</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22617/22617" title="Download a Plucker version of Chambers's Edinburgh Journal, No. 454 by Various">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/1/22617/22617-h/22617-h.htm" title="Download a Marked-up HTML version of Chambers's Edinburgh Journal, No. 454 by Various">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">2</td> <td width="40" align="right" class="guten">22621</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22621" title="The New England Magazine, Volume 1, No. 1, January 1886 by Various from Project Gutenberg">The New England Magazine, Volume 1, No. 1, January 1886 by Various</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22621/22621" title="Download a Plucker version of The New England Magazine, Volume 1, No. 1, January 1886 by Various">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/2/22621/22621-h/22621-h.htm" title="Download a Marked-up HTML version of The New England Magazine, Volume 1, No. 1, January 1886 by Various">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">3</td> <td width="40" align="right" class="guten">22610</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22610" title="Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various from Project Gutenberg">Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22610/22610" title="Download a Plucker version of Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/1/22610/22610-h/22610-h.htm" title="Download a Marked-up HTML version of Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">4</td> <td width="40" align="right" class="guten">22612</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22612" title="Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various from Project Gutenberg">Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22612/22612" title="Download a Plucker version of Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/1/22612/22612-h/22612-h.htm" title="Download a Marked-up HTML version of Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">5</td> <td width="40" align="right" class="guten">22611</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22611" title="The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous from Project Gutenberg">The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22611/22611" title="Download a Plucker version of The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/1/22611/22611-h/22611-h.htm" title="Download a Marked-up HTML version of The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">6</td> <td width="40" align="right" class="guten">22609</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22609" title="The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell from Project Gutenberg">The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22609/22609" title="Download a Plucker version of The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/0/22609/22609-h/22609-h.htm" title="Download a Marked-up HTML version of The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">7</td> <td width="40" align="right" class="guten">22619</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22619" title="International Copyright by George Haven Putnam from Project Gutenberg">International Copyright by George Haven Putnam</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22619/22619" title="Download a Plucker version of International Copyright by George Haven Putnam">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/1/22619/22619-h/22619-h.htm" title="Download a Marked-up HTML version of International Copyright by George Haven Putnam">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">8</td> <td width="40" align="right" class="guten">22614</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22614" title="A Pavorosa Illusão by Manuel Maria Barbosa du Bocage">A Pavorosa Illusão by Manuel Maria Barbosa du Bocage</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22614/22614" title="Download a Plucker version of A Pavorosa Illusão by Manuel Maria Barbosa du Bocage">pdb</a> </td><td class="guten" align="center"><strike>html</strike></td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">9</td> <td width="40" align="right" class="guten">22616</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22616" title="Salve, Rei! by Camilo Castelo Branco from Project Gutenberg">Salve, Rei! by Camilo Castelo Branco</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22616/22616" title="Download a Plucker version of Salve, Rei! by Camilo Castelo Branco">pdb</a> </td><td class="guten" align="center"><strike>html</strike></td><td class="guten" align="center"><strike>txt</strike></td></tr> <tr><td width="40" align="center" class="guten">10</td> <td width="40" align="right" class="guten">22604</td><td width="500" class="guten"><a href="http://www.gutenberg.org/etext/22604" title="Children and Their Books by James Hosmer Penniman from Project Gutenberg">Children and Their Books by James Hosmer Penniman</a></td><td align="center" class="guten"><a href="http://www.gutenberg.org/cache/plucker/22604/22604" title="Download a Plucker version of Children and Their Books by James Hosmer Penniman">pdb</a> </td><td class="guten" align="center"><a href="http://www.gutenberg.org/dirs/2/2/6/0/22604/22604-h/22604-h.htm" title="Download a Marked-up HTML version of Children and Their Books by James Hosmer Penniman">html</a> </td><td class="guten" align="center"><strike>txt</strike></td></tr> </table>
<p>In the above table, you can see that some elements are <strike>striked out</strike>. This is done with the following snippet of code:
<code>
my %gutentypes = (
plucker => {
'mirror' => "http://www.gutenberg.org/cache/plucker/$1/$1",
'content-type' => 'application/prs.plucker',
'string' => 'Plucker',
'format' => 'pdb'
},
html => {
'mirror' => "http://www.gutenberg.org/dirs/$splitguten/$1/$1-h/$1-h.htm",
'content-type' => 'text/html',
'string' => 'Marked-up HTML',
'format' => 'html'
},
text => {
'mirror' => "http://sailor.gutenberg.lib.md.us/$splitguten/$1/$1.txt",
'content-type' => 'text/plain',
'string' => 'Plain text',
'format' => 'txt'
},
);
for my $types ( sort keys %gutentypes ) {
my ($status, $type) = test_head($gutentypes{$types}{mirror});
if ($status == 200) {
$gutentypes{$types}{link} = qq{<a href="$gutentypes{$types}{mirror}">$gutentypes{$types}{format}</a>\n};
} else {
$gutentypes{$types}{link} = qq{<strike>$gutentypes{$types}{format}</strike>};
}
}
sub test_head {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent('pps Plucker Perl Spider, v0.1.83 [rss]');
my $request = HTTP::Request->new(HEAD => $url);
my $response = $ua->request($request);
my $status = $response->status_line;
my $type = $response->header('Content-Type');
my $content = $response->content;
$status =~ m/(\d+)/;
return ($1, $type);
}</code>
<p>The number of items shown in the list, is controlled with a scalar I set for maximum, and an array slice <code>for my $line (@gutenbooks[0 .. $maximum-1]) {...}</code>.
<p>The more books I want to show, the longer it takes for the page to draw, because I'm doing a HEAD request on every title 3 times (plucker, html, text), and linking/striking-out accordingly.
<p>If I display 15 titles, that's at least 45 HEAD requests I have to make. It happens in under 2-5 seconds, depending on the latency to the mirror servers I'm pointing to, but it is still a delay. If one of those mirrors is not responding, the page load time could take forever (or until the remote end or user's browser times out).
<p>I looked into using [cpan://HTTP::Lite], [cpan://HTTP:GHTTP], [cpan://LWP::Simple] and others to try to speed it up, but straight [cpan://LWP::UserAgent] was far-and-away the fastest (by about 3x), so I'm back to the drawing board.
<p>I also looked into using [cpan://LWP::Parallel::UserAgent] and/or [cpan://LWP::Parallel::ForkManager], but they're a bit more complex than I'd hoped (registering the links, then passing through a callback, etc.)
<p>This was briefly discussed in the CB yesterday and [bart] (I think, forgive me if I have the wrong monk), suggested that I just check HEAD every hour/day or at some interval, unrelated to the user's request of the same page, and store the results in a database, and have my script always query the database, instead of hitting the remote urls directly every time my page is requested. He's right to a point... 45 or 100 or 200 database queries is MUCH faster than issuing a new HEAD request 3 times for each title displayed.
<p>After thinking about this, it presents a few possible problems:
<ol>
<li>If I check the links at 1am in a cron(1) job on the server-side, and a user visits the page at 7pm that night, the links may be down/invalid/changed/redirected.
<li>Coupling my script to the system processes (i.e. a cron job), doesn't make it a clean and portable as I'd like, if I have to move it from system to system (it also doesn't easily allow me to move it to an upstream hosting provider where I may not have access to cron).
</ol>
<p>Another suggestion was that I use some AJAX glue, and let the end-user's browser figure out which links were dead or not, ONLY when they decide to click upon them.
<p>This too, presents some problems:
<ol>
<li>It limits the feature to those with a browser supporting Javascript (and having it enabled, from what I understand, a shrinking minority)
<li>It does not work in text-mode browsers or for web spiders
<li>Visually, there is no indication of which titles are available in that format or not.
</ol>
<p>Is there an easier way to do this, so the end-user experience is not so hampered?