Re: NASA's Astronomy Picture of the Day

I hope you had fun, here is something new :) a walkthrough of how to shorten your html parsing stuff, declarative style (i think)

$ lwp-download http://apod.nasa.gov/apod/ apod.html
4.21 KB received

$ perl htmltreexpather.pl apod.html _tag p | head -n 6

HTML::Element=HASH(0xb5ed04)    0.1.1.0
Milky Way Over Piton de l'Eau
/html/body/center[2]/b
/html/body/center[2]/b
/html/body[@link='#0000FF' and @vlink='#7F0F9F' and @alink='#FF0000' a
+nd @bgcolor='#F4F4FF' and @text='#000000']/center[2]/b
------------------------------------------------------------------
[download]

Then plug stuff into scraper/Web::Scraper

$ scraper apod.html
scraper> d
$VAR1 = {};
scraper> process '/html/body/center/p[2]' => 'Date' => 'TEXT';
scraper> d
$VAR1 = {
  'Date' => ' 2012 June 25  '
};
scraper> process '//b' => 'b[]' => 'TEXT';
scraper> y
---
Date: ' 2012 June 25  '
b:
  - " Milky Way Over Piton de l'Eau "
  - ' Image Credit & Copyright: '
  - ' Explanation: '
  - ' Help Evaluate APOD: '
  - " Tomorrow's picture: "
  - ' Authors & editors: '
  - 'NASA Official: '
  - 'A service of:'
  - '&'
scraper> c all
#!c:\perl\5.14.1\bin\MSWin32-x86-multi-thread\perl.exe
use strict;
use Web::Scraper;
use URI;

my $file = \do { my $file = "apod.html"; open my $fh, $file or die "$f
+ile: $!"; join '', <$fh> };
my $scraper = scraper {
    process '/html/body/center/p[2]' => 'Date' => 'TEXT';
    process '//b' => 'b[]' => 'TEXT';
};
my $result = $scraper->scrape($file);
scraper> q
[download]

And repeat. Firefox/Firebug can be useful for extracting xpaths. You can end up with


my $scraper = scraper {
    process '//b[1]' => 'Title' => 'TEXT';
    process '/html/body/center[2]/b[2]' => 'Credit' => 'TEXT';
    process '/html/body/p[1]' => 'Desc' => 'TEXT';
    process '/html/body/center/p[2]' => 'Date' => 'TEXT';
#~     process q{//a[ @href =~ "image/" ]} => 'Image' => '@HREF';
    process q{//a[ contains(@href, "image/") ]} => 'Image' => '@HREF';
};

## NOTE use URI object so scraper will download (read) file
my $url  =  URI->new( 'file:apod.html' ); 
my $base =  'http://apod.nasa.gov/apod/';
my $ret  =  $scraper->scrape( $url , $base );
[download]

You can also mirror the html file ( LWP::Simple::mirror() ) and only scrape-it if its new

Comment on Re: NASA's Astronomy Picture of the Day Select or Download Code

Replies are listed 'Best First'.
Re^2: NASA's Astronomy Picture of the Day by nightgoat (Acolyte) on Jul 05, 2012 at 20:18 UTC
Very cool! Thanks for mentioning Web::Scraper. I had been using LWP::Simple for something similar, and I always like to check out other CPAN modules to see if there might be a better/more fun way to do it.	[reply]
Re^2: NASA's Astronomy Picture of the Day by grondilu (Friar) on Nov 04, 2012 at 04:21 UTC
This seems a bit complicated, imho. Isn't there a way to make it simpler by using the RSS instead of HTML?	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks