Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^3: Extracting span and meta content with HTML::TreeBuilder

by poj (Priest)
on Jul 16, 2014 at 21:35 UTC ( #1093946=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Extracting span and meta content with HTML::TreeBuilder
in thread Extracting span and meta content with HTML::TreeBuilder

I guessed that might be the case, how about using Xpath ?

#!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file(\*DATA); my @items = $tree->findnodes( '//div[@class="review-content"]' ) or die("no items: $!\n"); for my $item (@items) { for ( $item->findnodes( '//meta') ){ print $_->attr('itemprop'); print ' = '; print $_->attr('content')."\n"; } }
poj


Comment on Re^3: Extracting span and meta content with HTML::TreeBuilder
Download Code
Re^4: Extracting span and meta content with HTML::TreeBuilder
by wrinkles (Pilgrim) on Jul 16, 2014 at 22:17 UTC
    That's getting the meta data, but also way too much of what I don't want.
      poj has shown you how to get the meta properties - to get the date just add a test:
      next unless $_->attr('itemprop') eq 'datePublished';
      Ok, try another approach
      #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author|datePublished|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; }
      poj
        poj,

        Yes, that's perfect, thank you!

        The docs on the HTML::TB::XP module is not sufficient (at least for me) to understand how your code works. Where is the documentation that would help me understand this? Did you go to some key documentation to help you sort this out? What do you recommend for me to understand this?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1093946]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2014-12-28 00:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls