Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Parsing HTML

by marcoss (Novice)
on Jun 12, 2012 at 11:05 UTC ( #975750=note: print w/ replies, xml ) Need Help??


in reply to Re: Parsing HTML
in thread Parsing HTML

Hi mirod, before I go ahead ...THANK YOU!!. XPath opened a brand new world of possibilities for me. I took a look at Zvon's page and also this page, which is a little bit more for beginners. The thing is I was able to use your code and also add a few things for the other pieces of information that I needed to extract. Right now it's working just fine, but there's a detail that I haven't been able to modify (basically because the last part of the code you wrote are almost hieroglyphs to me...xD) Anyway, this is code:

#!/usr/bin/perl -w use LWP::Simple; use HTML::TreeBuilder::XPath; use Data::Dumper; use strict; my $debug=1; my $base='http://www.costacrociere.it'; my $url='/it/lista_crociere/capitali_nord_europa-201207-2.html'; my $page = get($base.$url) or die $!; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); binmode( STDOUT, ':utf8'); my @trips= $p->findnodes( '//div[@class="info-cruise"]'); foreach my $trip (@trips){ my $title = $trip->findvalue( './/div[@class="sx"]/h3'); print "Trip name: $title\n"; my $price = $trip->findvalue( './/span[@class="new-price"]'); print "price: $price\n"; my $includes = $trip->findvalue('.//p[@class="info-price"]/spa +n[6]'); #I added this line print "Includes: $includes\n"; foreach my $info ( $trip->findnodes( './/p[@class="itinerari-i +nfo"]//span[@class != "note" and @class != "strike"]')){ my $info_title= $info->findnodes( './b')->[0]; print $info_title->as_text(); $info_title->detach; my $info_value= $info->as_text; print ":", $info_value, "\n"; } my $pic = $trip->findvalue('.//img[@class="image_map"]/@src'); # I + added this line. print "Picture: $base$pic\n"; print "\n"; }

And this is the output, well... just one of the results, all of it is not necessary

Trip name: Fiordi norvegesi e grandi città del Baltico price: € 2.615,00 Includes: Crociera + Volo Itinerario : Danimarca, Estonia, Russia, Finlandia, Svezia, Norvegia Data partenza: 7 luglio 2012 Nave : Costa Luminosa N.ro giorni crociera   : 14 Porto di partenza : Copenhagen Documenti di viaggio : Passaporto Picture: http://www.costacrociere.it/B2C/Images/ItineraryV4/CPH11040__ +it-IT.gif#CPH11040

Yes, I know what you're thinking... "That's my code...this guy didn't do anything", and you're quite right, I just added those 2 lines. But the good thing is I'm learning!!.. Using only Treebuilder was giving me a lot of headaches. Ok, so the detail I was telling you about, as you can see in the output, certain pieces of information have an extra space at the beginning. I've been trying with chomp and different print and \n ways, but nothing does the trick. Where should I look?. Right now, what I'm doing is some research to understand what every line of the second foreach loop does. If you can give some directions on this I will greatly appreciate it (again)!!

Cheers!!

marcos


Comment on Re^2: Parsing HTML
Select or Download Code
Re^3: Parsing HTML
by Anonymous Monk on Jun 12, 2012 at 11:34 UTC
Re^3: Parsing HTML
by mirod (Canon) on Jun 12, 2012 at 11:56 UTC

    It's a bit of a pain to figure out where to look, but the as_text method comes from HTML::Element. If you look at the docs, you'll see that in addition to as_text there is also a as_trimmed_text method. I looks like you could use it.

    The secon foreach loop comes from looking at the HTML source for the page. The data you want is in the p with a class of itinerari-info, in consecutive span. Some of the span's can be discarded, the ones with classes of note and strike. That's what the XPath experssion returns. Each span includes a b element with the title, which I get in $info_title, display then detach to get it out of the way. The rest of the span is the information itself.

    Does this help?

      Ok, this clarifies a lot. The as_trimmed_text worked just fine. I tried commenting the detach line, and like you said, it'll print the title twice. But then, it seems like you have seen something I completely overlooked. The strike attribute is only for dates that have been removed, that's why I didn't see it before... but still when I execute the script, the date shows up. Is it a matter of using an if statement?... Because it looks to me that the foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"]//span[@class != "note" and @class != "strike"]')) should take care of it. mmmm I'm thinking of unless but those are only assumptions... I'll let you know if I fix this, even though probably...eventually, I'll be crying out for help xD. Anyway, thank very much for your time and your patience.

      cheers!

      marcos

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://975750]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (13)
As of 2014-08-22 12:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (157 votes), past polls