http://www.perlmonks.org?node_id=193671

Things are slow here at work. Too slow. So I've hit upon a project to keep me busy and extend my Perl knowledge. (Let's just say I'm a few steps short of Adept. ;)

What would it take to write a web spider that would regex out the weather forecast from my local Bureau of Meteorology (www.bom.gov.au) and email that information to myself, my wife, and a few select friends?

Am I right in assuming that the LWP module is wanted for something like this? Also, is there any complication involved in scheduling this sort of thing as a cron job?

So I guess my questions are:

Update: For the record this is not a homework question. I've long since graduated! I've taught myself all I know about Perl out of reading the Llama book, Conway's OO Perl, etc, and I'm looking around for "mini-projects" I can do to further develop my Perl skills. Your patience and occasional help with this is much appreciated!

Replies are listed 'Best First'.
Re: Weather goest thou, spider?
by mojotoad (Monsignor) on Aug 29, 2002 at 05:38 UTC
    There's some handy ones out there already, namely Geo::Weather and Geo::WeatherNOAA. (I've never used either, so I can't give any reviews).

    If you decide to roll your own, yes, LWP would be a natural starting point (use LWP::Simple for ease of implementation, or the main user agent and message modules if you want to learn a bit about sockets, etc).

    As for mining the weather report results, a regexp might be the way to go, or you might want to consider using an HTML parser -- this is a judgement call depending on the nature of the page.

    The main thing to consider with cron job scheduling is a) automation and b) output. The script must not be awaiting interaction from a user, and results (success or failure) will be appearing in logs or a notification email, so keep the format in mind.

    Good luck!
    Matt

Re: Weather goest thou, spider?
by rattusillegitimus (Friar) on Aug 29, 2002 at 05:47 UTC

    Sounds like an interesting project to me ;) I'd definitely start with LWP to pull the forecast data from the website. Depending on how the site is set up, you'll probably want to use one of the html parsing modules, like HTML::Parser or HTML::TokeParser to extract what you want from the resulting html. CPAN has a plethora of email modules, but I'm not terribly familiar with any of them, so all I can suggest is to check them out and see which one fits your needs and your system. ;)

    As for other projects, there have been numerous threads with suggestions, like Do your homework! or Exercises (I've seen more, but I'm too tired and lazy to dig them up right now ;} ). Grab one of those, or think of a small tool you'd like to have at your fingertips and dive in. Some projects that have taught me the most came about because of off-hand remarks people made to me that made me thing "I could do that in Perl!" Next thing you know, I've got blinkenlights and Perlmonks Age Stats. You could also surf through the CUFP, Code Catacombs, and Craft sections of this site to find scripts you could build onto/improve for your own use. The possibilities are limited only by your imagination.;)

    --
    $rattusillegitimus = Eliza::PerlMonks::Robot::new()

Re: Weather goest thou, spider?
by Molt (Chaplain) on Aug 29, 2002 at 10:13 UTC

    If you want a nice and complete discussion about writing spiders and parsing HTML you may want to look at the new O'Reilly tome Perl and LWP. This includes many examples of mining information from websites, ranging from using a few regexps to pull out the information, to rebuilding the HTML in tree from and throwing it out again, or spidering entire sites in the correct manner.

    I've recently had to write a spider for work and whilst I'd got it working and doing what we needed this book pointed out a few things I'd over-looked thus allowing me to tighten things and cut down the chances of things falling to pieces. Well recommended.

Re: Weather goest thou, spider?
by Aristotle (Chancellor) on Sep 01, 2002 at 21:29 UTC
    I wrote two scripts that use German web weather services; both only output to console but they should provide a starting point. One uses regexen to pull out the data (and doesn't work verbatim because their pages seem to frequently move), the other relies on HTML::TableExtract.