Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: NASA's Astronomy Picture of the Day

by Anonymous Monk
on Jun 25, 2012 at 11:04 UTC ( #978185=note: print w/ replies, xml ) Need Help??


in reply to NASA's Astronomy Picture of the Day

I hope you had fun, here is something new :) a walkthrough of how to shorten your html parsing stuff, declarative style (i think)

$ lwp-download http://apod.nasa.gov/apod/ apod.html
4.21 KB received

$ perl htmltreexpather.pl apod.html _tag p | head -n 6

HTML::Element=HASH(0xb5ed04) 0.1.1.0 Milky Way Over Piton de l'Eau /html/body/center[2]/b /html/body/center[2]/b /html/body[@link='#0000FF' and @vlink='#7F0F9F' and @alink='#FF0000' a +nd @bgcolor='#F4F4FF' and @text='#000000']/center[2]/b ------------------------------------------------------------------

Then plug stuff into scraper/Web::Scraper

$ scraper apod.html scraper> d $VAR1 = {}; scraper> process '/html/body/center/p[2]' => 'Date' => 'TEXT'; scraper> d $VAR1 = { 'Date' => ' 2012 June 25 ' }; scraper> process '//b' => 'b[]' => 'TEXT'; scraper> y --- Date: ' 2012 June 25 ' b: - " Milky Way Over Piton de l'Eau " - ' Image Credit & Copyright: ' - ' Explanation: ' - ' Help Evaluate APOD: ' - " Tomorrow's picture: " - ' Authors & editors: ' - 'NASA Official: ' - 'A service of:' - '&' scraper> c all #!c:\perl\5.14.1\bin\MSWin32-x86-multi-thread\perl.exe use strict; use Web::Scraper; use URI; my $file = \do { my $file = "apod.html"; open my $fh, $file or die "$f +ile: $!"; join '', <$fh> }; my $scraper = scraper { process '/html/body/center/p[2]' => 'Date' => 'TEXT'; process '//b' => 'b[]' => 'TEXT'; }; my $result = $scraper->scrape($file); scraper> q

And repeat. Firefox/Firebug can be useful for extracting xpaths. You can end up with

my $scraper = scraper { process '//b[1]' => 'Title' => 'TEXT'; process '/html/body/center[2]/b[2]' => 'Credit' => 'TEXT'; process '/html/body/p[1]' => 'Desc' => 'TEXT'; process '/html/body/center/p[2]' => 'Date' => 'TEXT'; #~ process q{//a[ @href =~ "image/" ]} => 'Image' => '@HREF'; process q{//a[ contains(@href, "image/") ]} => 'Image' => '@HREF'; }; ## NOTE use URI object so scraper will download (read) file my $url = URI->new( 'file:apod.html' ); my $base = 'http://apod.nasa.gov/apod/'; my $ret = $scraper->scrape( $url , $base );

You can also mirror the html file ( LWP::Simple::mirror() ) and only scrape-it if its new


Comment on Re: NASA's Astronomy Picture of the Day
Select or Download Code
Re^2: NASA's Astronomy Picture of the Day
by nightgoat (Acolyte) on Jul 05, 2012 at 20:18 UTC

    Very cool! Thanks for mentioning Web::Scraper. I had been using LWP::Simple for something similar, and I always like to check out other CPAN modules to see if there might be a better/more fun way to do it.

Re^2: NASA's Astronomy Picture of the Day
by grondilu (Pilgrim) on Nov 04, 2012 at 04:21 UTC

    This seems a bit complicated, imho. Isn't there a way to make it simpler by using the RSS instead of HTML?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://978185]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2015-07-06 10:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (71 votes), past polls