Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

NASA's Astronomy Picture of the Day

by xyzzy (Pilgrim)
on Jun 25, 2012 at 08:42 UTC ( #978153=CUFP: print w/replies, xml ) Need Help??

Tonight I was going to do some work, but first I had to set up my environment on a new (shared) computer. That meant configuring my WM (fluxbox), which included replacing the drab light grey background that comes set by default. Since I had some time and didn't really want to work on real things, I ended up writing something I wanted to have for a while: a script to set the background (wallpaper) to NASA's Astronomy Picture of the Day.

The intended use is to run it as a cron job. The default functionality will maintain a symlink to the latest image providing easy access for a WM or whatever else you want to use it. The metadata (title, credit, description) is extracted from the page and stored in a YAML file, mainly because I'm considering adding an ImageMagick script to create an annotated image. This would require the display resolution to be provided in the config section so as to render the text in an appropriate size and position relative to the screen. Depending on your WM, it may be simpler to create overlays/widgets/gadgets/etc. to display this data. Or you can just ignore it.

You can also pass a date as the argument (in YYMMDD format, to match the URL format of the APOD archive) to fetch old pictures. The date is not validated beyond checking that it consists of 6 digits. Also the script will refuse to overwrite files, because that makes sense to me. I used a number of such "ghetto" Ways To Do It (noted in comments) because I wanted to keep it simple and functional. I had to use some pretty ghetto techniques to parse the APOD page, which is completely invalid and uses no semantic markup at all, so I figured I might as well ghetto-ize the whole script. If you want to use this, you should probably tinker with it to suit your needs.

Requires HTML::Parser, LWP and YAML::XS. Intended for POSIX systems only. If you're going to use this in Windows, you should at the very least set $LINK to undef

EDIT: added a line to remove old CURRENT link before updating it... apparently symlink will fail when asked to overwrite a link...

#!/usr/bin/perl -w # Downloads NASA Astromy Picture of the Day # with metadatas in YAML format (HTML is stripped) # # Images are downloaded to a directory and stored # by date. # # Script can also maintain a symlink to the latest # picture for anyone who cares to use it to any end # # Modify the configuration in the main package below use strict; package HTML::Parser::AOTD; use base q/HTML::Parser/; my $state = 'date'; my $count = 0; sub start { my ($self, $tagname, $attr, $attrseq, $text) = @_; if ($state eq 'date') { ++$count if $tagname eq 'p'; } elsif ($state eq 'image') { if ($tagname eq 'a') { $self->{image} = $attr->{href}; $state = 'title'; } } elsif ($state eq 'title') { if ($tagname eq 'b') { ++$count; } } elsif ($state eq 'desc') { if ($tagname eq 'center') { # ghetto whitespace tidying for ($self->{desc}) {s/\s/ /gs;s/ / /g;s/^ +//;s/ +$//} $count = 0; $state = 'done'; } } } sub text { my ($self, $text) = @_; if ($state eq 'date' && $count == 2) { $text =~ s/\s*$//; $text =~ s/^\s*//; $self->{date} = $text; $count = 0; $state = 'image'; } elsif ($state eq 'title' && $count) { for($text){s/^ +//;s/ +$//} $self->{title} = $text; $count = 0; $state = 'credit'; } elsif ($state eq 'credit') { if ($count > 1) { chomp $text; $self->{credit} .= $text; } elsif ($text =~ /Image Credit/i) { ++$count; } } elsif ($state eq 'desc') { if ($count > 1) { $self->{desc} .= $text; } elsif ($text =~ /Explanation/i) { ++$count; } } } sub end { my ($self, $tagname) = @_; if ($state eq 'credit') { ++$count if $tagname eq 'b'; if ($tagname eq 'center') { $count = 0; $state = 'desc'; } } elsif ($state eq 'desc') { ++$count if $tagname eq 'b'; } } 1; package main; use LWP::Simple; use YAML::XS; # CONFIG # base url my $AOTD_URL = q||; # directory to store data my $DIRECTORY = q|/home/shared/apod/|; # name of symlink to current image/yaml # written for Unix-like systems # to disable, set to undef my $LINK = "$DIRECTORY/CURRENT"; # ghetto date conversion my %month = ( January => '01', February => '02', March => '03', April => '04', May => '05', June => '06', July => '07', August => '08', September => '09', October => '10', November => '11', December => '12' ); sub usage { print <<END; Usage: [YYMMDD] See source for configuration options END exit } my $arg = shift; # ghetto help option usage if ($arg && $arg =~ /^--?h(elp)?$/); if ($arg) { if ($arg =~ /^\d{6}$/) { # ghetto date validation $AOTD_URL .= "ap$arg.html"; } else { warn "Invalid date. Using current image."; $arg = undef; } } my $p = HTML::Parser::AOTD->new(); my $html = get($AOTD_URL) or die "Could not fetch from $AOTD_URL. Maybe the date is wrong?\n"; $p->parse($html); for ($p->{date}) { s/ /-/g; s/([A-Za-z]+)/$month{$1}/e; # ghetto date conversion } my $file = "$DIRECTORY/$p->{date}.yaml"; die "$file exists - remove manually to download again\n" if -e $file; my ($ext) = $p->{image} =~ /\.(\w+)$/; my $imgfile = "$DIRECTORY/$p->{date}.$ext"; getstore("$AOTD_URL_BASE/$p->{image}", $imgfile) == RC_OK or die "Something went wrong when downloading $p->{image}: $!"; open my $fh, '>', $file or die "can't write to $file: $!"; print $fh Dump { title => $p->{title}, credit => $p->{credit}, desc => $p->{desc}, imgfile => $imgfile, }; close $fh; if ($LINK && !$arg) { -e && unlink for ($LINK, "$LINK.yaml") symlink $file, "$LINK.yaml"; symlink $imgfile, $LINK; } __END__ `,= (K) 3178 \/\/_____\/ /\ \ / / \ /__/__ ALL RIGHTS REVERSED

$,=qq.\n.;print q.\/\/____\/.,q./\ \ / / \\.,q.    /_/__.,q..
Happy, sober, smart: pick two.

Replies are listed 'Best First'.
Re: NASA's Astronomy Picture of the Day
by Anonymous Monk on Jun 25, 2012 at 11:04 UTC

    I hope you had fun, here is something new :) a walkthrough of how to shorten your html parsing stuff, declarative style (i think)

    $ lwp-download apod.html
    4.21 KB received

    $ perl apod.html _tag p | head -n 6

    HTML::Element=HASH(0xb5ed04) Milky Way Over Piton de l'Eau /html/body/center[2]/b /html/body/center[2]/b /html/body[@link='#0000FF' and @vlink='#7F0F9F' and @alink='#FF0000' a +nd @bgcolor='#F4F4FF' and @text='#000000']/center[2]/b ------------------------------------------------------------------

    Then plug stuff into scraper/Web::Scraper

    $ scraper apod.html scraper> d $VAR1 = {}; scraper> process '/html/body/center/p[2]' => 'Date' => 'TEXT'; scraper> d $VAR1 = { 'Date' => ' 2012 June 25 ' }; scraper> process '//b' => 'b[]' => 'TEXT'; scraper> y --- Date: ' 2012 June 25 ' b: - " Milky Way Over Piton de l'Eau " - ' Image Credit & Copyright: ' - ' Explanation: ' - ' Help Evaluate APOD: ' - " Tomorrow's picture: " - ' Authors & editors: ' - 'NASA Official: ' - 'A service of:' - '&' scraper> c all #!c:\perl\5.14.1\bin\MSWin32-x86-multi-thread\perl.exe use strict; use Web::Scraper; use URI; my $file = \do { my $file = "apod.html"; open my $fh, $file or die "$f +ile: $!"; join '', <$fh> }; my $scraper = scraper { process '/html/body/center/p[2]' => 'Date' => 'TEXT'; process '//b' => 'b[]' => 'TEXT'; }; my $result = $scraper->scrape($file); scraper> q

    And repeat. Firefox/Firebug can be useful for extracting xpaths. You can end up with

    my $scraper = scraper { process '//b[1]' => 'Title' => 'TEXT'; process '/html/body/center[2]/b[2]' => 'Credit' => 'TEXT'; process '/html/body/p[1]' => 'Desc' => 'TEXT'; process '/html/body/center/p[2]' => 'Date' => 'TEXT'; #~ process q{//a[ @href =~ "image/" ]} => 'Image' => '@HREF'; process q{//a[ contains(@href, "image/") ]} => 'Image' => '@HREF'; }; ## NOTE use URI object so scraper will download (read) file my $url = URI->new( 'file:apod.html' ); my $base = ''; my $ret = $scraper->scrape( $url , $base );

    You can also mirror the html file ( LWP::Simple::mirror() ) and only scrape-it if its new

      Very cool! Thanks for mentioning Web::Scraper. I had been using LWP::Simple for something similar, and I always like to check out other CPAN modules to see if there might be a better/more fun way to do it.

      This seems a bit complicated, imho. Isn't there a way to make it simpler by using the RSS instead of HTML?

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://978153]
Approved by cjb
Front-paged by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2022-12-02 13:02 GMT
Find Nodes?
    Voting Booth?