Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Laziness through CPAN: Screen-scrape to RSS with 3 Modules

by crashtest (Curate)
on Dec 21, 2009 at 02:44 UTC ( #813661=CUFP: print w/ replies, xml ) Need Help??

I'm not sure this is a cool use for Perl, but once again, I am astounded by how easy the easy things really are. The script below is one of those where you're almost surprised when you're done writing it. "That's it?", you ask yourself. Yes, that's it.

Here's the background: there's a certain trail race that I'd like to run, but there are always more applicants than slots, so the organizers have resorted to a lottery system to pick entrants this year. Unfortunately, I didn't get in, but I am in the top 25 on the wait list.

The lottery winners have until midnight tonight to pay their entry fee - otherwise the wait-listed people move into their slots. On the lottery page, it is clearly indicated who has, and who hasn't, paid their entry fee yet. Now I could obsessively sit at my computer, refresh the page every five minutes and count the "Not Paid" entrants... or I could be obsessive and lazy, and enlist Perl for help.

With just three use directives, I'm in business:

use LWP::UserAgent; use HTML::TableParser; use XML::RSS;
And now in 50 non-optimized lines, I can easily write a script that screen-scrapes the web page (using LWP::UserAgent), counts the people who've paid and those who haven't (via HTML::TableParser), then print a simple RSS file (with XML::RSS) to a web-accessible spot that I've now added to my News Reader application (Google Reader).

The script is scheduled via cron. Since I can check my news reader on my phone, I am free to walk around, eat dinner etc. while tracking something I have absolutely no control over. Perfect!

I've done something like this before, in order to track the waiver wire in a fantasy league. But I am struck by how easy this really was, and totally worthwhile even though I can put this script in the trash after midnight.

I've also thought that this basic process - scrape -> parse -> post - can be implemented in thousands of ways using many other tools and technologies. Have other monks done similar things in the past? How would you have approached my problem?

use strict; use warnings; use LWP::UserAgent; use HTML::TableParser; use XML::RSS; use constant ENTRANTS_PAGE => 'http://www.example.com/lotteryentr?eventid=1221'; use constant RSS_FILE => '/var/www/html/lottery_entrants.xml'; my $response = LWP::UserAgent->new()->get(ENTRANTS_PAGE); die $response->status_line() unless ($response->is_success()); my ($paid, $not_paid) = (0, 0); my $p = HTML::TableParser->new([ { cols => 'Paid', row => sub { $_[2]->[3] eq 'Paid' and $paid++ or $_[2]->[3] eq 'Not Paid' and $not_paid++; } }], { Decode => 1, Trim => 1, Chomp => 1 }); $p->parse($response->content()); my $now = localtime; my $rss; if (-s RSS_FILE){ $rss = XML::RSS->new(); $rss->parsefile(RSS_FILE); } else{ $rss = XML::RSS->new( version => '2.0' ); $rss->channel( title => 'Lottery Entrants', pubDate => $now, syn => { updatePeriod => "hourly", updateFrequency => "3", updateBase => "1901-01-01T00:00+00:00", }); } $rss->add_item( title => "Entrants at $now", description => "$paid have paid, $not_paid haven't"); $rss->save(RSS_FILE);

Comment on Laziness through CPAN: Screen-scrape to RSS with 3 Modules
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://813661]
Approved by ww
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2014-08-23 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (173 votes), past polls