Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: NASA's Astronomy Picture of the Day

by Anonymous Monk
on Jun 25, 2012 at 11:04 UTC ( #978185=note: print w/replies, xml ) Need Help??


in reply to NASA's Astronomy Picture of the Day

I hope you had fun, here is something new :) a walkthrough of how to shorten your html parsing stuff, declarative style (i think)

$ lwp-download http://apod.nasa.gov/apod/ apod.html
4.21 KB received

$ perl htmltreexpather.pl apod.html _tag p | head -n 6

HTML::Element=HASH(0xb5ed04) 0.1.1.0 Milky Way Over Piton de l'Eau /html/body/center[2]/b /html/body/center[2]/b /html/body[@link='#0000FF' and @vlink='#7F0F9F' and @alink='#FF0000' a +nd @bgcolor='#F4F4FF' and @text='#000000']/center[2]/b ------------------------------------------------------------------

Then plug stuff into scraper/Web::Scraper

$ scraper apod.html scraper> d $VAR1 = {}; scraper> process '/html/body/center/p[2]' => 'Date' => 'TEXT'; scraper> d $VAR1 = { 'Date' => ' 2012 June 25 ' }; scraper> process '//b' => 'b[]' => 'TEXT'; scraper> y --- Date: ' 2012 June 25 ' b: - " Milky Way Over Piton de l'Eau " - ' Image Credit & Copyright: ' - ' Explanation: ' - ' Help Evaluate APOD: ' - " Tomorrow's picture: " - ' Authors & editors: ' - 'NASA Official: ' - 'A service of:' - '&' scraper> c all #!c:\perl\5.14.1\bin\MSWin32-x86-multi-thread\perl.exe use strict; use Web::Scraper; use URI; my $file = \do { my $file = "apod.html"; open my $fh, $file or die "$f +ile: $!"; join '', <$fh> }; my $scraper = scraper { process '/html/body/center/p[2]' => 'Date' => 'TEXT'; process '//b' => 'b[]' => 'TEXT'; }; my $result = $scraper->scrape($file); scraper> q

And repeat. Firefox/Firebug can be useful for extracting xpaths. You can end up with

my $scraper = scraper { process '//b[1]' => 'Title' => 'TEXT'; process '/html/body/center[2]/b[2]' => 'Credit' => 'TEXT'; process '/html/body/p[1]' => 'Desc' => 'TEXT'; process '/html/body/center/p[2]' => 'Date' => 'TEXT'; #~ process q{//a[ @href =~ "image/" ]} => 'Image' => '@HREF'; process q{//a[ contains(@href, "image/") ]} => 'Image' => '@HREF'; }; ## NOTE use URI object so scraper will download (read) file my $url = URI->new( 'file:apod.html' ); my $base = 'http://apod.nasa.gov/apod/'; my $ret = $scraper->scrape( $url , $base );

You can also mirror the html file ( LWP::Simple::mirror() ) and only scrape-it if its new

Replies are listed 'Best First'.
Re^2: NASA's Astronomy Picture of the Day
by nightgoat (Acolyte) on Jul 05, 2012 at 20:18 UTC

    Very cool! Thanks for mentioning Web::Scraper. I had been using LWP::Simple for something similar, and I always like to check out other CPAN modules to see if there might be a better/more fun way to do it.

Re^2: NASA's Astronomy Picture of the Day
by grondilu (Friar) on Nov 04, 2012 at 04:21 UTC

    This seems a bit complicated, imho. Isn't there a way to make it simpler by using the RSS instead of HTML?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://978185]
help
Chatterbox?
[Corion]: perldigious: That seems to be more the export and likely it's the recipients of that export that like the titles changes
[Corion]: ... "changed"
[Corion]: I usually expect fixed header names, but am sometimes lenient in the order of columns. But changing the report titles often sounds to me as if you are not the sole consument of the export ;)
[shmem]: perldigious: as always - if it ain't broke, don't fix it. Ther must be a very compelling reason for changing column names in a database. Those are rare.
[Corion]: If you have whitespace in the column names in the database, whap the DBAs ;)

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2017-05-25 13:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?