Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Parsing HTML

by marcoss (Novice)
on Jun 07, 2012 at 09:59 UTC ( #974906=perlquestion: print w/replies, xml ) Need Help??

marcoss has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need to extract certain pieces of information from a website. There's a <p> tag with 5 <span> tags inside it. One of the spans has a class, so no problem, but the other 4 are just <span>info</span>. This is how the code looks in the website. I'm using Firebug.

<p class="itinerari-info"> <span> <b> Itinerario </b> Danimarca, fiordi norvegesi, Germania </span> <span class="DepartureDateTitle"> <b>Data partenza</b> 17&nbsp;giugno&nbsp;2012 </span> <span> <b> Nave </b> Costa Fortuna </span> <span> <b> giorni crociera &nbsp; </b> 7 </span> <span> <b> Porto di partenza </b> Copenhagen </span> </p>

My perl knowledge is limited to the first nine chapters of "Learning Perl" (and that doesn't mean I understand everything, especially sub routines) I don't have any other programming skills.

This is the code I have so far:

#!/usr/bin/perl -w use LWP::Simple; use HTML::TreeBuilder; use strict; my $base=''; my $url='/it/lista_crociere/capitali_nord_europa-201206.html'; my $page = get($base.$url) or die $!; my $p = HTML::TreeBuilder->new_from_content( $page ); my @trips= $p->look_down(_tag=>'p',class=>'itinerari-info')->as_text; foreach my $trip (@trips){ print $trip; }

And this is the output:

Itinerario  Danimarca, fiordi norvegesi, Germania  Data partenza 17&#65533;giugno&#65533;2012   Nave  Costa Fortuna giorni crociera &#65533;  7  Porto di partenza  Copenhagen Documenti di viaggio  Passaporto&#65533;o&#65533;Carta d'identit&#65533; valida per l'espatrio Possono essere disponibili le seguenti tariffe

So, this outputs all the information in one string and also with some strange characters, but I can use a regex to fix that. The main issue is that I need every string to be independent from each other (As if I wanted to add a title prior to each information itself). I see that the spans have <b>whatever</b> tags, but I just can't seem to understand how I could use those to do what I want. Like I said, my experience is close to zero. I've been trying different stuff with arrays and hashes and right now I just want to burn the computer. If the Monks could help me I would greatly appreciate it. Thank you so much!

Replies are listed 'Best First'.
Re: Parsing HTML
by Corion (Patriarch) on Jun 07, 2012 at 10:13 UTC

    I would use XPath or CSS expressions, and look at HTML::TreeBuilder::XPath to run the expressions against the HTML. Or rather, I would use App::scrape, which puts that approach into a module, or Web::Scraper and Web::Magic.

    With XPath expressions, you can specify the elements you want like paths to files in a directory. In your case, it looks like the following XPath expressions would work:

    # Each voyage //p[@class="itinerari-info"] # Itinerary within a voyage ./span[1] # Departure date ./span[2] # Ship ./span[3] ...

    Depending on whether your target page only lists one such itinerary, you can roll the XPath expressions into one expression, instead of using them relative to the voyage nodes:

    # Itinerary //p[@class="itinerari-info"]/span[1] ...

    You can test out these queries in Firebug (I think), or with scrape-ff tool in WWW::Mechanize::Firefox, or with the scrape tool in App::scrape. Likely, Mojolicious and the modules mentioned before also contain tools for easy command line testing of XPath expressions against URLs.

Re: Parsing HTML
by mirod (Canon) on Jun 07, 2012 at 10:37 UTC

    Below is a solution. It uses HTML::TreeBuilder::XPath, which (like Corion) I find easier to use than "bare" HTML::TreeBuilder. I also added an option so while working on the code you don't have to keep hitting the live page. it will be more polite, and for you much faster, to use a cache.

    Also, the problems you had with weird characters can be solved by telling the code that you want to output UTF-8, using binmode( STDOUT, ':utf8');.

    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use Perl6::Slurp; # to load the page from the cache use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil +der # during development we don't want to hit the real page, # so we'll have a -c switch to use a cache use Getopt::Std; my %opt; getopts( 'c', \%opt); # if called with -c then $opt{c} is true my $base=''; my $url='/it/lista_crociere/capitali_nord_europa-201206.html'; my $cache= 'capitali_nord_europa-201206.html'; # this will get rid of the bad characters you were seeing in the outpu +t binmode( STDOUT, ':utf8'); if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live +page without -c my $page= slurp '<:utf8', $cache; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); my @trips= $p->findnodes( '//p[@class="itinerari-info"]'); foreach my $trip (@trips){ # you may want to do something more complex here, but for now it wi +ll do print "crociera: ", $trip->as_text, "\n"; }

      Hi mirod, before I go ahead ...THANK YOU!!. XPath opened a brand new world of possibilities for me. I took a look at Zvon's page and also this page, which is a little bit more for beginners. The thing is I was able to use your code and also add a few things for the other pieces of information that I needed to extract. Right now it's working just fine, but there's a detail that I haven't been able to modify (basically because the last part of the code you wrote are almost hieroglyphs to me...xD) Anyway, this is code:

      #!/usr/bin/perl -w use LWP::Simple; use HTML::TreeBuilder::XPath; use Data::Dumper; use strict; my $debug=1; my $base=''; my $url='/it/lista_crociere/capitali_nord_europa-201207-2.html'; my $page = get($base.$url) or die $!; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); binmode( STDOUT, ':utf8'); my @trips= $p->findnodes( '//div[@class="info-cruise"]'); foreach my $trip (@trips){ my $title = $trip->findvalue( './/div[@class="sx"]/h3'); print "Trip name: $title\n"; my $price = $trip->findvalue( './/span[@class="new-price"]'); print "price: $price\n"; my $includes = $trip->findvalue('.//p[@class="info-price"]/spa +n[6]'); #I added this line print "Includes: $includes\n"; foreach my $info ( $trip->findnodes( './/p[@class="itinerari-i +nfo"]//span[@class != "note" and @class != "strike"]')){ my $info_title= $info->findnodes( './b')->[0]; print $info_title->as_text(); $info_title->detach; my $info_value= $info->as_text; print ":", $info_value, "\n"; } my $pic = $trip->findvalue('.//img[@class="image_map"]/@src'); # I + added this line. print "Picture: $base$pic\n"; print "\n"; }

      And this is the output, well... just one of the results, all of it is not necessary

      Trip name: Fiordi norvegesi e grandi città del Baltico price: € 2.615,00 Includes: Crociera + Volo Itinerario : Danimarca, Estonia, Russia, Finlandia, Svezia, Norvegia Data partenza: 7 luglio 2012 Nave : Costa Luminosa giorni crociera   : 14 Porto di partenza : Copenhagen Documenti di viaggio : Passaporto Picture: +it-IT.gif#CPH11040

      Yes, I know what you're thinking... "That's my code...this guy didn't do anything", and you're quite right, I just added those 2 lines. But the good thing is I'm learning!!.. Using only Treebuilder was giving me a lot of headaches. Ok, so the detail I was telling you about, as you can see in the output, certain pieces of information have an extra space at the beginning. I've been trying with chomp and different print and \n ways, but nothing does the trick. Where should I look?. Right now, what I'm doing is some research to understand what every line of the second foreach loop does. If you can give some directions on this I will greatly appreciate it (again)!!



        It's a bit of a pain to figure out where to look, but the as_text method comes from HTML::Element. If you look at the docs, you'll see that in addition to as_text there is also a as_trimmed_text method. I looks like you could use it.

        The secon foreach loop comes from looking at the HTML source for the page. The data you want is in the p with a class of itinerari-info, in consecutive span. Some of the span's can be discarded, the ones with classes of note and strike. That's what the XPath experssion returns. Each span includes a b element with the title, which I get in $info_title, display then detach to get it out of the way. The rest of the span is the information itself.

        Does this help?

      Hi mirod, thank you so much for the solution provided!! I had to remove some lines because (for what i understand) you're using perl 6 and my version is v5.10.1. I'm not familiar with HTML::TreeBuilder::XPath and the findnode function, so I've been doing some research. I want to see if by using your script I can obtain not only all of the trips with all it's details, but all of the trips with the details separately. for example, this is the output I need for each trip:

      Trip Name: Nordic seas Price: 500 Itinerary: Denmark, Oslo, Helsinki Departure date: 12/04/2012 Ship Name: Costa Magica Includes: Cruise Departure port: Copenhagen Duration: 7 days
      In this way I can later take all those individual pieces of information to a database. Like I said, I'm new to Perl, and all I do is trial & error, so until I have more time to study during the summer I will appreciate all the help you guys at PerlMonks can provide me. Thanks again for all the great work!!!

        Perl6::Slurp is a regular Perl 5 module, it just emulates Perl 6's slurp builtin. Learning a bit of XPath is always useful, look at Zvon's tutorial for example.

        As for the rest, you need to look at the source of the page, see what information you need and what XPath queries will get it for you. The cruise info is not for example in the p.itinerari-info, it's in the element. From that element you can get the title and price, then go down some more and get the various other fields.


        Here is an example, which does not output the 'Includes' field, you'll have to do this one yourself.:

        #!/usr/bin/perl -w use strict; use warnings; use LWP::Simple; use Perl6::Slurp; # to load the page from the cache use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil +der # during development we don't want to hit the real page, # so we'll have a -c switch to use a cache use Getopt::Std; my %opt; getopts( 'c', \%opt); # if called with -c then $opt{c} is true my $base=''; my $url='/it/lista_crociere/capitali_nord_europa-201206.html'; my $cache= 'capitali_nord_europa-201206.html'; # this will get rid of the bad characters you were seeing in the outpu +t binmode( STDOUT, ':utf8'); if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live +page without -c my $page= slurp '<:utf8', $cache; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); my @trips= $p->findnodes( '//div[@class="info-cruise"]'); foreach my $trip (@trips){ my $title = $trip->findvalue( './/div[@class="sx"]/h3'); print "$title\n"; my $price = $trip->findvalue( './/span[@class="new-price"]'); print "price: $price\n"; # this is very brittle, but it gives you a base on which you can bu +ild foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"] +//span[@class != "note" and @class != "strike"]')) { my $info_title= $info->findnodes( './b')->[0]; print $info_title->as_text(); $info_title->detach; my $info_value= $info->as_text; print ": ", $info_value, "\n"; } print "\n"; }
Re: Parsing HTML
by ww (Archbishop) on Jun 07, 2012 at 12:16 UTC
    This is just an "I wonder.... observation... about something that's very possibly not a factor, but the .html you show is peculiar to say the least... and if it isn't what Firebug is telling you, that might bear on your attempt to parse it.

    <span>... </span> tags without attributes amount to no-ops.

    I don't use Firebug so I have no solid reason to suspect that it's pruning tags for some reason... but, to me (YMMV), that makes at least as much sense as .html burdened with no-ops that have to ride the wire along with the substance of the page. It might be well to look at the source using view source and view generated source.

    OTOH, maybe the code generating the page was written -- with very limited knowledge of .html -- by the DBA responsible for the data. That supposition arises from the use of &nbsp; in the date (six keystrokes where one would have been sufficient -- for no good reason I can discern).

    PS: If you want each trip on its own, separate line, you need merely add a newline to the print $trip; at line 12 -- e.g.  print "$trip \n"; or print $trip . "\n";".

    PPS: This puzzled me enough to make me actually look at the page in question... and it does, indeed, appear to have code very similar to what you've shown. There are a couple support files that were inaccessible, when I looked, but imputing any issue to them is merely speculative and probably a non-starter.

      "<span>... </span> tags without attributes amount to no-ops."

      Not in this case - the page in question is using them to add line breaks within paragraphs. Something along the lines of:

      p.itinerari-info span { display: block }
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Hi, the source code is exactly the same. I usually work with firebug and the source code simultaneosly, but firebug is cool. Putting newlines wouldn't be a solution, I think... because the output of the current script are cruise trip names with all the details, and what I need is an output that gets me the details separately (within the cruise trip).

        For a start, you might want to use split ( split ) to break up $trip into its elements. But your recent sample-output-desired post involves additional data (for example, "Trip Name"...) which I ignored in checking the original .html. Therefore (among other reasons), I'm not sure that
        is an appropriate pattern for split.

        Even if so, you'll still have to hard-code some punctuation (such as the colons in the subheads) and, perhaps, the newlines.

        The previous newline suggestion was based on the output you showed with multiple itineraries as a single line.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://974906]
Approved by Ratazong
Front-paged by toolic
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2022-12-01 02:40 GMT
Find Nodes?
    Voting Booth?