Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^3: Parsing HTML

by mirod (Canon)
on Jun 07, 2012 at 12:34 UTC ( #974928=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Parsing HTML
in thread Parsing HTML

Perl6::Slurp is a regular Perl 5 module, it just emulates Perl 6's slurp builtin. Learning a bit of XPath is always useful, look at Zvon's tutorial for example.

As for the rest, you need to look at the source of the page, see what information you need and what XPath queries will get it for you. The cruise info is not for example in the p.itinerari-info, it's in the div.sx element. From that element you can get the title and price, then go down some more and get the various other fields.

.

Here is an example, which does not output the 'Includes' field, you'll have to do this one yourself.:

#!/usr/bin/perl -w use strict; use warnings; use LWP::Simple; use Perl6::Slurp; # to load the page from the cache use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil +der # during development we don't want to hit the real page, # so we'll have a -c switch to use a cache use Getopt::Std; my %opt; getopts( 'c', \%opt); # if called with -c then $opt{c} is true my $base='http://www.costacrociere.it'; my $url='/it/lista_crociere/capitali_nord_europa-201206.html'; my $cache= 'capitali_nord_europa-201206.html'; # this will get rid of the bad characters you were seeing in the outpu +t binmode( STDOUT, ':utf8'); if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live +page without -c my $page= slurp '<:utf8', $cache; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); my @trips= $p->findnodes( '//div[@class="info-cruise"]'); foreach my $trip (@trips){ my $title = $trip->findvalue( './/div[@class="sx"]/h3'); print "$title\n"; my $price = $trip->findvalue( './/span[@class="new-price"]'); print "price: $price\n"; # this is very brittle, but it gives you a base on which you can bu +ild foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"] +//span[@class != "note" and @class != "strike"]')) { my $info_title= $info->findnodes( './b')->[0]; print $info_title->as_text(); $info_title->detach; my $info_value= $info->as_text; print ": ", $info_value, "\n"; } print "\n"; }


Comment on Re^3: Parsing HTML
Download Code
Re^4: Parsing HTML
by Anonymous Monk on Jun 08, 2012 at 04:57 UTC

    :D I might approach that like this (look ma, no slurping )

    $ lwp-download http://www.costacrociere.it/it/lista_crociere/capitali_nord_europa-201206.html
    Saving to 'capitali_nord_europa-201206.html'...
    134 KB received in 1 seconds (134 KB/sec)

    $ perl htmltreexpather.pl capitali_nord_europa-201206.html _tag p | ack Copenhagen -C3 | head

    //div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise +']/div[@class='sx']/p[@class='note'] ------------------------------------------------------------------ HTML::Element=HASH(0xb91ba4) 0.1.0.8.1.0.1.1.1.0.0 Itinerario Danimarca, fiordi norvegesi, Germania Data partenza 17Šgiug +noŠ2012 Nave Costa Fortuna N.ro giorni crociera Š 7 Porto di partenza Copenhagen Documenti di viaggio PassaportoŠoŠCarta + d'identit&#9500;Š valida per l'espatrio Possono essere disponibili le seguenti tariffe /html/body/form/div/div[2]/div/div[2]/div/div[2]/div/p //div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI +nfoCruise']/p //div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI +nfoCruise']/p[@class='itinerari-info'] -- //div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise +']/div[@class='sx']/p[@class='note'] ------------------------------------------------------------------

    Then plug stuff into Web::Scraper , its like XML::Rules

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use URI; use Web::Scraper; my $soy = scraper { ## only get leafs/twigs with this @class ## store the results into { info => \@info } process '.info-cruise' => 'info[]' => scraper { process './/div[@class="sx"]/h3' => 'title' => 'TEXT'; process '.new-price' => 'price' => 'TEXT'; process '.itinerari-info' => 'span[]' => scraper { #~ process '//span' => 'span[]' => 'RAW'; ## this process '//span/b | //span/child::text()' => 'span[]' => s +ub { my $ishtml = $_[0]->isa('HTML::Element'); my $keyOrVal = $ishtml ? 'key' : 'val'; my %foo = ( $keyOrVal => $_[0]->getValue ); $foo{raw} = $_[0]->as_XML if $ishtml; return \%foo; }; }; }; }; ## NOTE Web::Scraper wants URI objects my $url = URI->new('file:capitali_nord_europa-201206.html'); my $base='http://www.costacrociere.it'; my $ret = $soy->scrape( $url , $base ); #~ dd $ret; dd $ret->{info}->[0]; __END__ { price => "\x{20AC} 510,00", span => [ { span => [ { key => " Itinerario ", raw => "<b> Itinerario </b>\ +n" }, { val => " Danimarca, fiordi norvegesi, Germania" }, { val => " " }, { key => "Data partenza", raw => "<b>Data partenza</b +>\n" }, { val => " 17\xA0giugno\xA02012 " }, { key => " Nave ", raw => "<b> Nave </b>\n" }, { val => " Costa Fortuna" }, { key => " N.ro giorni crociera \xA0 ", raw => "<b> N.ro giorni crociera \xA0 </b>\n", }, { val => " 7" }, { key => " Porto di partenza ", raw => "<b> Porto di +partenza </b>\n" }, { val => " Copenhagen" }, { key => " Documenti di viaggio ", raw => "<b> <a href=\"http://www.costacrociere.it/B +2C/I/Before_you_go/documentation/travel.htm\" target=\"_blank\">Docum +enti di viaggio</a> </b>\n", }, { val => " Passaporto\xA0o\xA0Carta d'identit\xE0 val +ida per l'espatrio", }, { val => " Possono essere disponibili le seguenti tar +iffe " }, ], }, ], title => "Le terre dei vichinghi", }

    I wouldn't be surprised if tobyink stops by with a Web::Magic example :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974928]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2014-09-20 04:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (152 votes), past polls