Re^3: Parsing HTML

Perl6::Slurp is a regular Perl 5 module, it just emulates Perl 6's slurp builtin. Learning a bit of XPath is always useful, look at Zvon's tutorial for example.

As for the rest, you need to look at the source of the page, see what information you need and what XPath queries will get it for you. The cruise info is not for example in the p.itinerari-info, it's in the div.sx element. From that element you can get the title and price, then go down some more and get the various other fields.

Here is an example, which does not output the 'Includes' field, you'll have to do this one yourself.:

#!/usr/bin/perl -w

use strict;
use warnings;

use LWP::Simple;
use Perl6::Slurp;             # to load the page from the cache
use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil
+der

# during development we don't want to hit the real page, 
# so we'll have a -c switch to use a cache 
use Getopt::Std;
my %opt;
getopts( 'c', \%opt); # if called with -c then $opt{c} is true

my $base='http://www.costacrociere.it';
my $url='/it/lista_crociere/capitali_nord_europa-201206.html';
my $cache= 'capitali_nord_europa-201206.html';

# this will get rid of the bad characters you were seeing in the outpu
+t
binmode( STDOUT, ':utf8');

if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live 
+page without -c
my $page= slurp '<:utf8', $cache;

my $p = HTML::TreeBuilder::XPath->new_from_content( $page );

my @trips= $p->findnodes( '//div[@class="info-cruise"]');
foreach my $trip (@trips){
   my $title = $trip->findvalue( './/div[@class="sx"]/h3');
   print "$title\n";

   my $price = $trip->findvalue( './/span[@class="new-price"]');
   print "price: $price\n";

   # this is very brittle, but it gives you a base on which you can bu
+ild
   foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"]
+//span[@class != "note" and @class != "strike"]'))
     { 
       my $info_title= $info->findnodes( './b')->[0];
       print $info_title->as_text();
       $info_title->detach;
       my $info_value= $info->as_text;
       print ": ", $info_value, "\n";
    }
  print "\n";
       
}
[download]

Comment on Re^3: Parsing HTML Download Code

Replies are listed 'Best First'.

Re^4: Parsing HTML
by Anonymous Monk on Jun 08, 2012 at 04:57 UTC

:D I might approach that like this (look ma, no slurping )

$ lwp-download http://www.costacrociere.it/it/lista_crociere/capitali_nord_europa-201206.html
Saving to 'capitali_nord_europa-201206.html'...
134 KB received in 1 seconds (134 KB/sec)

$ perl htmltreexpather.pl capitali_nord_europa-201206.html _tag p | ack Copenhagen -C3 | head

//div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise
+']/div[@class='sx']/p[@class='note']
------------------------------------------------------------------
HTML::Element=HASH(0xb91ba4)    0.1.0.8.1.0.1.1.1.0.0
Itinerario Danimarca, fiordi norvegesi, Germania Data partenza 17ágiug
+noá2012 Nave Costa Fortuna N.ro giorni crociera á
7 Porto di partenza Copenhagen Documenti di viaggio PassaportoáoáCarta
+ d'identit&#9500;á valida per l'espatrio Possono essere
disponibili le seguenti tariffe
/html/body/form/div/div[2]/div/div[2]/div/div[2]/div/p
//div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI
+nfoCruise']/p
//div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI
+nfoCruise']/p[@class='itinerari-info']
--
//div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise
+']/div[@class='sx']/p[@class='note']
------------------------------------------------------------------
[download]

Then plug stuff into Web::Scraper , its like XML::Rules

#!/usr/bin/perl --
use strict; use warnings;
use Data::Dump;
use URI;
use Web::Scraper;


my $soy = scraper {
## only get leafs/twigs with this @class
## store the results into  { info => \@info }
    process '.info-cruise' => 'info[]' => scraper {
        process './/div[@class="sx"]/h3' => 'title'  => 'TEXT';
        process '.new-price'             => 'price'  => 'TEXT';
        process '.itinerari-info'        => 'span[]' => scraper {

#~             process '//span' => 'span[]' => 'RAW'; ## this
            process '//span/b | //span/child::text()' => 'span[]' => s
+ub {
                my $ishtml   = $_[0]->isa('HTML::Element');
                my $keyOrVal = $ishtml ? 'key' : 'val';
                my %foo      = ( $keyOrVal => $_[0]->getValue );
                $foo{raw} = $_[0]->as_XML if $ishtml;
                return \%foo;
            };
        };
    };
};

## NOTE Web::Scraper wants URI objects
my $url = URI->new('file:capitali_nord_europa-201206.html');
my $base='http://www.costacrociere.it';
my $ret = $soy->scrape( $url , $base );

#~ dd $ret;
dd $ret->{info}->[0];

__END__
{
  price => "\x{20AC} 510,00",
  span  => [
             {
               span => [
                 { key => " Itinerario ", raw => "<b> Itinerario </b>\
+n" },
                 { val => " Danimarca, fiordi norvegesi, Germania" },
                 { val => " " },
                 { key => "Data partenza", raw => "<b>Data partenza</b
+>\n" },
                 { val => " 17\xA0giugno\xA02012 " },
                 { key => " Nave ", raw => "<b> Nave </b>\n" },
                 { val => " Costa Fortuna" },
                 {
                   key => " N.ro giorni crociera \xA0 ",
                   raw => "<b> N.ro giorni crociera \xA0 </b>\n",
                 },
                 { val => " 7" },
                 { key => " Porto di partenza ", raw => "<b> Porto di 
+partenza </b>\n" },
                 { val => " Copenhagen" },
                 {
                   key => " Documenti di viaggio ",
                   raw => "<b> <a href=\"http://www.costacrociere.it/B
+2C/I/Before_you_go/documentation/travel.htm\" target=\"_blank\">Docum
+enti di viaggio</a> </b>\n",
                 },
                 {
                   val => " Passaporto\xA0o\xA0Carta d'identit\xE0 val
+ida per l'espatrio",
                 },
                 { val => " Possono essere disponibili le seguenti tar
+iffe " },
               ],
             },
           ],
  title => "Le terre dei vichinghi",
}
[download]

I wouldn't be surprised if tobyink stops by with a Web::Magic example :)

[reply]
[d/l]
[select]


go ahead... be a heretic
	PerlMonks