Update: Thanks to helpful nudges by toj and others, I have completed a script that checks the latest reviews for my school, and sends notifications by email. See Check popular review sites for new reviews.
I'm using HTML::TreeBuilder to parse some Yelp pages of my favorite Mexican restaurant(s). I want to create a list of review dates and star ratings for specific business.
The problem is that I want information that is contained in span and meta tags, that don't seem to be a part of the element tree.
Here is the relevant section of HTML:
<div class="review-content">
<div class="biz-rating biz-rating-very-large clearfix">
<div itemtype="http://schema.org/Rating" itemscope="" itemprop="re
+viewRating">
<div class="rating-very-large">
<i title="4.0 star rating" class="star-img stars_4">
<img width="84" height="303" src="http://blah/v2/stars_map
+.png" class="offscreen" alt="4.0 star rating">
</i>
<meta content="4.0" itemprop="ratingValue">
</div>
</div>
<span class="rating-qualifier">
<meta content="2011-01-13" itemprop="datePublished">
1/13/2011
</span>
</div>
<p lang="en" itemprop="description" class="review_comment ieSucks">
blah!!
</p>
</div>
And here is the element tree:
$tree->look_down( 'class', 'review-content' )
<div class="review-content">
<div class="biz-rating biz-rating-very-large clearfix">
<div itemprop="reviewRating" itemscope="itemscope" itemtype="http:
+//schema.org/Rating">
<div class="rating-very-large">
<i class="star-img stars_4" title="4.0 star rating">
<img alt="4.0 star rating" class="offscreen" height="303" src="h
+ttp://blah/v2/stars_map.png" width="84" />
</i>
</div>
</div>
</div>
</div>
So far I have the working program below which prints the rating, but I haven't been able to access the span that contains the date. Thanks for your help!
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use Data::Dumper;
use LWP::Simple qw(get);
use Text::Unidecode qw(unidecode);
use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
my $review_pages = [
[
'Jorges #1',
'http://www.yelp.com/biz/jorges-mexicatessen-encinitas'
],
[
'Jorges #2',
'http://www.yelp.com/biz/jorges-mexicatessen-encinitas-2'
]
];
for my $page (@$review_pages) {
my $html = get $page->[1];
$html =~ s/([^[:ascii:]]+)/unidecode($1)/ge;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse($html);
print "Review for $page->[0]\n";
my @items = $tree->look_down( 'class', 'review-content' )
or die("no items: $!\n");
for my $item (@items) {
my $rating = $item->look_down( '_tag', 'i' )
or die("no rating$!\n");
my $rating_title = $rating->attr('title');
print "$rating_title\n";
}
}