Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^2: Parsing HTML

by marcoss (Novice)
on Jun 07, 2012 at 11:59 UTC ( #974924=note: print w/ replies, xml ) Need Help??


in reply to Re: Parsing HTML
in thread Parsing HTML

Hi mirod, thank you so much for the solution provided!! I had to remove some lines because (for what i understand) you're using perl 6 and my version is v5.10.1. I'm not familiar with HTML::TreeBuilder::XPath and the findnode function, so I've been doing some research. I want to see if by using your script I can obtain not only all of the trips with all it's details, but all of the trips with the details separately. for example, this is the output I need for each trip:

Trip Name: Nordic seas Price: 500 Itinerary: Denmark, Oslo, Helsinki Departure date: 12/04/2012 Ship Name: Costa Magica Includes: Cruise Departure port: Copenhagen Duration: 7 days
In this way I can later take all those individual pieces of information to a database. Like I said, I'm new to Perl, and all I do is trial & error, so until I have more time to study during the summer I will appreciate all the help you guys at PerlMonks can provide me. Thanks again for all the great work!!!


Comment on Re^2: Parsing HTML
Download Code
Re^3: Parsing HTML
by mirod (Canon) on Jun 07, 2012 at 12:34 UTC

    Perl6::Slurp is a regular Perl 5 module, it just emulates Perl 6's slurp builtin. Learning a bit of XPath is always useful, look at Zvon's tutorial for example.

    As for the rest, you need to look at the source of the page, see what information you need and what XPath queries will get it for you. The cruise info is not for example in the p.itinerari-info, it's in the div.sx element. From that element you can get the title and price, then go down some more and get the various other fields.

    .

    Here is an example, which does not output the 'Includes' field, you'll have to do this one yourself.:

    #!/usr/bin/perl -w use strict; use warnings; use LWP::Simple; use Perl6::Slurp; # to load the page from the cache use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil +der # during development we don't want to hit the real page, # so we'll have a -c switch to use a cache use Getopt::Std; my %opt; getopts( 'c', \%opt); # if called with -c then $opt{c} is true my $base='http://www.costacrociere.it'; my $url='/it/lista_crociere/capitali_nord_europa-201206.html'; my $cache= 'capitali_nord_europa-201206.html'; # this will get rid of the bad characters you were seeing in the outpu +t binmode( STDOUT, ':utf8'); if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live +page without -c my $page= slurp '<:utf8', $cache; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); my @trips= $p->findnodes( '//div[@class="info-cruise"]'); foreach my $trip (@trips){ my $title = $trip->findvalue( './/div[@class="sx"]/h3'); print "$title\n"; my $price = $trip->findvalue( './/span[@class="new-price"]'); print "price: $price\n"; # this is very brittle, but it gives you a base on which you can bu +ild foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"] +//span[@class != "note" and @class != "strike"]')) { my $info_title= $info->findnodes( './b')->[0]; print $info_title->as_text(); $info_title->detach; my $info_value= $info->as_text; print ": ", $info_value, "\n"; } print "\n"; }

      :D I might approach that like this (look ma, no slurping )

      $ lwp-download http://www.costacrociere.it/it/lista_crociere/capitali_nord_europa-201206.html
      Saving to 'capitali_nord_europa-201206.html'...
      134 KB received in 1 seconds (134 KB/sec)

      $ perl htmltreexpather.pl capitali_nord_europa-201206.html _tag p | ack Copenhagen -C3 | head

      //div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise +']/div[@class='sx']/p[@class='note'] ------------------------------------------------------------------ HTML::Element=HASH(0xb91ba4) 0.1.0.8.1.0.1.1.1.0.0 Itinerario Danimarca, fiordi norvegesi, Germania Data partenza 17Šgiug +noŠ2012 Nave Costa Fortuna N.ro giorni crociera Š 7 Porto di partenza Copenhagen Documenti di viaggio PassaportoŠoŠCarta + d'identit&#9500;Š valida per l'espatrio Possono essere disponibili le seguenti tariffe /html/body/form/div/div[2]/div/div[2]/div/div[2]/div/p //div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI +nfoCruise']/p //div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI +nfoCruise']/p[@class='itinerari-info'] -- //div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise +']/div[@class='sx']/p[@class='note'] ------------------------------------------------------------------

      Then plug stuff into Web::Scraper , its like XML::Rules

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use URI; use Web::Scraper; my $soy = scraper { ## only get leafs/twigs with this @class ## store the results into { info => \@info } process '.info-cruise' => 'info[]' => scraper { process './/div[@class="sx"]/h3' => 'title' => 'TEXT'; process '.new-price' => 'price' => 'TEXT'; process '.itinerari-info' => 'span[]' => scraper { #~ process '//span' => 'span[]' => 'RAW'; ## this process '//span/b | //span/child::text()' => 'span[]' => s +ub { my $ishtml = $_[0]->isa('HTML::Element'); my $keyOrVal = $ishtml ? 'key' : 'val'; my %foo = ( $keyOrVal => $_[0]->getValue ); $foo{raw} = $_[0]->as_XML if $ishtml; return \%foo; }; }; }; }; ## NOTE Web::Scraper wants URI objects my $url = URI->new('file:capitali_nord_europa-201206.html'); my $base='http://www.costacrociere.it'; my $ret = $soy->scrape( $url , $base ); #~ dd $ret; dd $ret->{info}->[0]; __END__ { price => "\x{20AC} 510,00", span => [ { span => [ { key => " Itinerario ", raw => "<b> Itinerario </b>\ +n" }, { val => " Danimarca, fiordi norvegesi, Germania" }, { val => " " }, { key => "Data partenza", raw => "<b>Data partenza</b +>\n" }, { val => " 17\xA0giugno\xA02012 " }, { key => " Nave ", raw => "<b> Nave </b>\n" }, { val => " Costa Fortuna" }, { key => " N.ro giorni crociera \xA0 ", raw => "<b> N.ro giorni crociera \xA0 </b>\n", }, { val => " 7" }, { key => " Porto di partenza ", raw => "<b> Porto di +partenza </b>\n" }, { val => " Copenhagen" }, { key => " Documenti di viaggio ", raw => "<b> <a href=\"http://www.costacrociere.it/B +2C/I/Before_you_go/documentation/travel.htm\" target=\"_blank\">Docum +enti di viaggio</a> </b>\n", }, { val => " Passaporto\xA0o\xA0Carta d'identit\xE0 val +ida per l'espatrio", }, { val => " Possono essere disponibili le seguenti tar +iffe " }, ], }, ], title => "Le terre dei vichinghi", }

      I wouldn't be surprised if tobyink stops by with a Web::Magic example :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974924]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (12)
As of 2014-07-24 18:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (164 votes), past polls