Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Extracting span and meta content with HTML::TreeBuilder

by poj (Priest)
on Jul 16, 2014 at 20:57 UTC ( #1093942=note: print w/ replies, xml ) Need Help??


in reply to Extracting span and meta content with HTML::TreeBuilder

Here is one way.

#!perl use strict; use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; $tree->parse_file(\*DATA); my @items = $tree->look_down( '_tag', 'meta' ) or die("no items: $!\n"); for my $item (@items) { print $item->attr('itemprop'); print ' = '; print $item->attr('content')."\n"; } __DATA__ <div class="review-content"> <div class="biz-rating biz-rating-very-large clearfix"> <div itemtype="http://schema.org/Rating" itemscope="" itemprop="re +viewRating"> <div class="rating-very-large"> <i title="4.0 star rating" class="star-img stars_4"> <img width="84" height="303" src="http://blah/v2/stars_map +.png" class="offscreen" alt="4.0 star rating"> </i> <meta content="4.0" itemprop="ratingValue"> </div> </div> <span class="rating-qualifier"> <meta content="2011-01-13" itemprop="datePublished"> 1/13/2011 </span> </div> <p lang="en" itemprop="description" class="review_comment ieSucks"> blah!! </p> </div>
poj


Comment on Re: Extracting span and meta content with HTML::TreeBuilder
Download Code
Re^2: Extracting span and meta content with HTML::TreeBuilder
by wrinkles (Pilgrim) on Jul 16, 2014 at 21:32 UTC
    Thanks poj, I first have to extract the "review-content" elements, and pull the span out of those. So I don't have that HTML snippet to work on directly. A nested look_down fails:
    for my $page (@$review_pages) { my $html = get $page->[1]; $html =~ s/([^[:ascii:]]+)/unidecode($1)/ge; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse($html); print "Review for $page->[0]\n"; my @items = $tree->look_down( 'class', 'review-content' ) or die("no items: $!\n"); for my $item (@items) { my @meta = $item->look_down( '_tag', 'meta' ) or die("no meta: $!\n"); # dies here for my $meta_item (@meta) { print $meta_item->attr('itemprop'); print ' = '; print $meta_item->attr('content') . "\n"; } } }
      I guessed that might be the case, how about using Xpath ?
      #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file(\*DATA); my @items = $tree->findnodes( '//div[@class="review-content"]' ) or die("no items: $!\n"); for my $item (@items) { for ( $item->findnodes( '//meta') ){ print $_->attr('itemprop'); print ' = '; print $_->attr('content')."\n"; } }
      poj
        That's getting the meta data, but also way too much of what I don't want.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1093942]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2015-07-07 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (88 votes), past polls