Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^3: Extracting span and meta content with HTML::TreeBuilder

by poj (Priest)
on Jul 16, 2014 at 21:35 UTC ( #1093946=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Extracting span and meta content with HTML::TreeBuilder
in thread Extracting span and meta content with HTML::TreeBuilder

I guessed that might be the case, how about using Xpath ?

#!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file(\*DATA); my @items = $tree->findnodes( '//div[@class="review-content"]' ) or die("no items: $!\n"); for my $item (@items) { for ( $item->findnodes( '//meta') ){ print $_->attr('itemprop'); print ' = '; print $_->attr('content')."\n"; } }
poj


Comment on Re^3: Extracting span and meta content with HTML::TreeBuilder
Download Code
Re^4: Extracting span and meta content with HTML::TreeBuilder
by wrinkles (Pilgrim) on Jul 16, 2014 at 22:17 UTC
    That's getting the meta data, but also way too much of what I don't want.
      poj has shown you how to get the meta properties - to get the date just add a test:
      next unless $_->attr('itemprop') eq 'datePublished';
      Ok, try another approach
      #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author|datePublished|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; }
      poj
        poj,

        Yes, that's perfect, thank you!

        The docs on the HTML::TB::XP module is not sufficient (at least for me) to understand how your code works. Where is the documentation that would help me understand this? Did you go to some key documentation to help you sort this out? What do you recommend for me to understand this?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1093946]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2014-09-19 02:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls