Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Parsing HTML

by Corion (Pope)
on Jun 07, 2012 at 10:13 UTC ( #974907=note: print w/ replies, xml ) Need Help??


in reply to Parsing HTML

I would use XPath or CSS expressions, and look at HTML::TreeBuilder::XPath to run the expressions against the HTML. Or rather, I would use App::scrape, which puts that approach into a module, or Web::Scraper and Web::Magic.

With XPath expressions, you can specify the elements you want like paths to files in a directory. In your case, it looks like the following XPath expressions would work:

# Each voyage //p[@class="itinerari-info"] # Itinerary within a voyage ./span[1] # Departure date ./span[2] # Ship ./span[3] ...

Depending on whether your target page only lists one such itinerary, you can roll the XPath expressions into one expression, instead of using them relative to the voyage nodes:

# Itinerary //p[@class="itinerari-info"]/span[1] ...

You can test out these queries in Firebug (I think), or with scrape-ff tool in WWW::Mechanize::Firefox, or with the scrape tool in App::scrape. Likely, Mojolicious and the modules mentioned before also contain tools for easy command line testing of XPath expressions against URLs.


Comment on Re: Parsing HTML
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://974907]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (12)
As of 2014-07-10 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (207 votes), past polls