Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^4: Parsing HTML

by Anonymous Monk
on Jun 08, 2012 at 04:57 UTC ( #975079=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Parsing HTML
in thread Parsing HTML

:D I might approach that like this (look ma, no slurping )

$ lwp-download http://www.costacrociere.it/it/lista_crociere/capitali_nord_europa-201206.html
Saving to 'capitali_nord_europa-201206.html'...
134 KB received in 1 seconds (134 KB/sec)

$ perl htmltreexpather.pl capitali_nord_europa-201206.html _tag p | ack Copenhagen -C3 | head

//div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise +']/div[@class='sx']/p[@class='note'] ------------------------------------------------------------------ HTML::Element=HASH(0xb91ba4) 0.1.0.8.1.0.1.1.1.0.0 Itinerario Danimarca, fiordi norvegesi, Germania Data partenza 17Šgiug +noŠ2012 Nave Costa Fortuna N.ro giorni crociera Š 7 Porto di partenza Copenhagen Documenti di viaggio PassaportoŠoŠCarta + d'identit├Š valida per l'espatrio Possono essere disponibili le seguenti tariffe /html/body/form/div/div[2]/div/div[2]/div/div[2]/div/p //div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI +nfoCruise']/p //div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI +nfoCruise']/p[@class='itinerari-info'] -- //div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise +']/div[@class='sx']/p[@class='note'] ------------------------------------------------------------------

Then plug stuff into Web::Scraper , its like XML::Rules

#!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use URI; use Web::Scraper; my $soy = scraper { ## only get leafs/twigs with this @class ## store the results into { info => \@info } process '.info-cruise' => 'info[]' => scraper { process './/div[@class="sx"]/h3' => 'title' => 'TEXT'; process '.new-price' => 'price' => 'TEXT'; process '.itinerari-info' => 'span[]' => scraper { #~ process '//span' => 'span[]' => 'RAW'; ## this process '//span/b | //span/child::text()' => 'span[]' => s +ub { my $ishtml = $_[0]->isa('HTML::Element'); my $keyOrVal = $ishtml ? 'key' : 'val'; my %foo = ( $keyOrVal => $_[0]->getValue ); $foo{raw} = $_[0]->as_XML if $ishtml; return \%foo; }; }; }; }; ## NOTE Web::Scraper wants URI objects my $url = URI->new('file:capitali_nord_europa-201206.html'); my $base='http://www.costacrociere.it'; my $ret = $soy->scrape( $url , $base ); #~ dd $ret; dd $ret->{info}->[0]; __END__ { price => "\x{20AC} 510,00", span => [ { span => [ { key => " Itinerario ", raw => "<b> Itinerario </b>\ +n" }, { val => " Danimarca, fiordi norvegesi, Germania" }, { val => " " }, { key => "Data partenza", raw => "<b>Data partenza</b +>\n" }, { val => " 17\xA0giugno\xA02012 " }, { key => " Nave ", raw => "<b> Nave </b>\n" }, { val => " Costa Fortuna" }, { key => " N.ro giorni crociera \xA0 ", raw => "<b> N.ro giorni crociera \xA0 </b>\n", }, { val => " 7" }, { key => " Porto di partenza ", raw => "<b> Porto di +partenza </b>\n" }, { val => " Copenhagen" }, { key => " Documenti di viaggio ", raw => "<b> <a href=\"http://www.costacrociere.it/B +2C/I/Before_you_go/documentation/travel.htm\" target=\"_blank\">Docum +enti di viaggio</a> </b>\n", }, { val => " Passaporto\xA0o\xA0Carta d'identit\xE0 val +ida per l'espatrio", }, { val => " Possono essere disponibili le seguenti tar +iffe " }, ], }, ], title => "Le terre dei vichinghi", }

I wouldn't be surprised if tobyink stops by with a Web::Magic example :)


Comment on Re^4: Parsing HTML
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://975079]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2014-12-26 03:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (164 votes), past polls