in reply to Parsing HTML
Below is a solution. It uses HTML::TreeBuilder::XPath, which (like Corion) I find easier to use than "bare" HTML::TreeBuilder. I also added an option so while working on the code you don't have to keep hitting the live page. it will be more polite, and for you much faster, to use a cache.
Also, the problems you had with weird characters can be solved by telling the code that you want to output UTF-8, using binmode( STDOUT, ':utf8');.
#!/usr/bin/perl use strict; use warnings; use LWP::Simple; use Perl6::Slurp; # to load the page from the cache use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil +der # during development we don't want to hit the real page, # so we'll have a -c switch to use a cache use Getopt::Std; my %opt; getopts( 'c', \%opt); # if called with -c then $opt{c} is true my $base='http://www.costacrociere.it'; my $url='/it/lista_crociere/capitali_nord_europa-201206.html'; my $cache= 'capitali_nord_europa-201206.html'; # this will get rid of the bad characters you were seeing in the outpu +t binmode( STDOUT, ':utf8'); if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live +page without -c my $page= slurp '<:utf8', $cache; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); my @trips= $p->findnodes( '//p[@class="itinerari-info"]'); foreach my $trip (@trips){ # you may want to do something more complex here, but for now it wi +ll do print "crociera: ", $trip->as_text, "\n"; }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Parsing HTML
by marcoss (Novice) on Jun 12, 2012 at 11:05 UTC | |
by mirod (Canon) on Jun 12, 2012 at 11:56 UTC | |
by marcoss (Novice) on Jun 13, 2012 at 08:22 UTC | |
by Anonymous Monk on Jun 12, 2012 at 11:34 UTC | |
Re^2: Parsing HTML
by marcoss (Novice) on Jun 07, 2012 at 11:59 UTC | |
by mirod (Canon) on Jun 07, 2012 at 12:34 UTC | |
by Anonymous Monk on Jun 08, 2012 at 04:57 UTC |
In Section
Seekers of Perl Wisdom