Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Extracting span and meta content with HTML::TreeBuilder

by poj (Priest)
on Jul 16, 2014 at 20:57 UTC ( #1093942=note: print w/ replies, xml ) Need Help??


in reply to Extracting span and meta content with HTML::TreeBuilder

Here is one way.

#!perl use strict; use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; $tree->parse_file(\*DATA); my @items = $tree->look_down( '_tag', 'meta' ) or die("no items: $!\n"); for my $item (@items) { print $item->attr('itemprop'); print ' = '; print $item->attr('content')."\n"; } __DATA__ <div class="review-content"> <div class="biz-rating biz-rating-very-large clearfix"> <div itemtype="http://schema.org/Rating" itemscope="" itemprop="re +viewRating"> <div class="rating-very-large"> <i title="4.0 star rating" class="star-img stars_4"> <img width="84" height="303" src="http://blah/v2/stars_map +.png" class="offscreen" alt="4.0 star rating"> </i> <meta content="4.0" itemprop="ratingValue"> </div> </div> <span class="rating-qualifier"> <meta content="2011-01-13" itemprop="datePublished"> 1/13/2011 </span> </div> <p lang="en" itemprop="description" class="review_comment ieSucks"> blah!! </p> </div>
poj


Comment on Re: Extracting span and meta content with HTML::TreeBuilder
Download Code
Re^2: Extracting span and meta content with HTML::TreeBuilder
by wrinkles (Pilgrim) on Jul 16, 2014 at 21:32 UTC
    Thanks poj, I first have to extract the "review-content" elements, and pull the span out of those. So I don't have that HTML snippet to work on directly. A nested look_down fails:
    for my $page (@$review_pages) { my $html = get $page->[1]; $html =~ s/([^[:ascii:]]+)/unidecode($1)/ge; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse($html); print "Review for $page->[0]\n"; my @items = $tree->look_down( 'class', 'review-content' ) or die("no items: $!\n"); for my $item (@items) { my @meta = $item->look_down( '_tag', 'meta' ) or die("no meta: $!\n"); # dies here for my $meta_item (@meta) { print $meta_item->attr('itemprop'); print ' = '; print $meta_item->attr('content') . "\n"; } } }
      I guessed that might be the case, how about using Xpath ?
      #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file(\*DATA); my @items = $tree->findnodes( '//div[@class="review-content"]' ) or die("no items: $!\n"); for my $item (@items) { for ( $item->findnodes( '//meta') ){ print $_->attr('itemprop'); print ' = '; print $_->attr('content')."\n"; } }
      poj
        That's getting the meta data, but also way too much of what I don't want.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1093942]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2014-09-19 04:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls