Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^5: Extracting span and meta content with HTML::TreeBuilder

by poj (Prior)
on Jul 17, 2014 at 12:20 UTC ( #1094025=note: print w/replies, xml ) Need Help??


in reply to Re^4: Extracting span and meta content with HTML::TreeBuilder
in thread Extracting span and meta content with HTML::TreeBuilder

Ok, try another approach
#!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author|datePublished|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; }
poj

Replies are listed 'Best First'.
Re^6: Extracting span and meta content with HTML::TreeBuilder
by wrinkles (Pilgrim) on Jul 18, 2014 at 01:41 UTC
    poj,

    Yes, that's perfect, thank you!

    The docs on the HTML::TB::XP module is not sufficient (at least for me) to understand how your code works. Where is the documentation that would help me understand this? Did you go to some key documentation to help you sort this out? What do you recommend for me to understand this?

      See all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ... tools like xpather.pl/htmltreexpather.pl can give you paths to start with

      findnodes gives you nodes ... or in case of treebuilder it gives HTML::Element object you can call methods on ... the other player gives XML::LibXML::Node be they XML::LibXML::Element or something else (libxml follows the DOM closely)

      This tutorial needs javascript http://zvon.org/comp/r/tut-XPath_1.html

      On the file you provided xpather spits out stuff like this

      /html/body/div/div/span # posy /html[1]/body[1]/div[1]/div[1]/span[1] # star /*[ local-name() = "html" and position() = 1 ] /*[ local-name() = "body" and position() = 1 ] /*[ local-name() = "div" and position() = 1 and @class = "review-content" ] /*[ local-name() = "div" and position() = 1 and @class = "biz-rating biz-rating-very-large clearfix" ] /*[ local-name() = "span" and @class = "rating-qualifier" and contains(string(), " 1/13/2011 ") ] # rats /html[1] /body[1] /*[ name() = "div" and position() = 1 and @class = "review-content" ] /*[ name() = "div" and position() = 1 and @class = "biz-rating biz-ra +ting-very-large clearfix" ] /*[ name() = "span" and position() = 1 and @class = "rating-qualifier +" ]

      Its a tree :) so  //meta means find a  <meta> anywhere where as  /foo/meta means find every child meta of root element foo <foo><meta></meta>....</foo>

      The examples/tuts give more better examples and explanations

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1094025]
help
Chatterbox?
[stevieb]: It seems as though some Strawberry Perl downloads have incorrect checksums all of a sudden. Could someone please download http://strawberryp erl.com/download/5 .12.3.0/strawberry -perl-5.12.3.0- portable.zip and do an sha1sum on...
[stevieb]: ...it? I'm getting 309ad7a9ba74614fcd 0c65bff7ea4400c10f a92f, but the http://strawberryp erl.com/releases. html states 0e267fc2cf5a16126d a6f9520cc7664db63d 2b57 and want to ensure it's not just me
[stevieb]: ...or the proper download link and release page even...
[pryrt]: da39a3ee5e6b4b0d32 55bfef95601890afd8 0709 *strawberry-perl-5 .12.3.0-portable. zip
[pryrt]: matches neither. :-)
[stevieb]: wtf!? lol
[stevieb]: I'm checking some other versions. This is a little frightening
[pryrt]: Sorry, my bad: it was't done downloading when I did the sha1sum... now I match your 309ad... value

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2017-03-29 21:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should Pluto Get Its Planethood Back?



    Results (353 votes). Check out past polls.