http://www.perlmonks.org?node_id=1054800

faozhi has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, i have the below html code from a page and i would like to extract the title and the price, but it is taking me so long.
This is the sample of the html file i am parsing:
<td width="135px"> <a href="google.com" title="Please help " class="p +roduct-image"><img src="google.com/blabla.jpg" width="135" height="18 +0" alt="Please help " /></a></td> <td valign="top"> <div class="category-description"> <h2 class="product-name"><a href="google.com" titl +e="Please help ">Please help </a></h2> <strong>Doodle Thomson </strong> <br/> At a time when many people are attempting to relat +e current events and trends in the world to interpretations of the pr +ophecies contained in the Book of Revelations, and the writings of No +stradamus, and the predictions of fashionable clairvoyants, the autho +r does much the same </div> </td> <td width="30%"> <div class="categoty_price"> <div class="price-box"> <span clas +s="regular-price" id="product-price-1139"> <span class="price">$20.00 +</span> </span>
I need the title (Please help) and the price ($20.00). It is a long html file with many more of these.
Please help. This is the code that i have so far but failing me...
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; if (@ARGV != 1 ) { die "./quick.book.parsing.pl <html file>\n";} #open html file open(HTML,$ARGV[0]) || die "Couldn't open file $ARGV[0]\n"; while (my $html = <HTML>) { next if $html =~ /<button/; chomp $html; if ($html =~m/title="(.*)" class/g) { my @columns =~ split (/\"/, $html); print "$columns[5]"; } } #{ # next if $html =~ /<button/; # chomp $html; # if ($html=~ m/title="(.*)" class/g) #{ # $columns[5] =~ s/^\s+|\s+$//g; # print "$columns[5]---"; #} # # if ($html=~ m/<span class="price">(.*)<\/span>/g) #{ # $html =~ s/^\s+|\s+$//g; # print "$html\n"; #} #} close HTML; exit;

Replies are listed 'Best First'.
Re: How can i have the titles and the prices?
by tobyink (Canon) on Sep 19, 2013 at 08:25 UTC

    Don't parse HTML with regexps. Use an HTML parser.

    #!/usr/bin/env perl use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; my $input = shift or die("Usage: $0 filename\n"); my $html = HTML::HTML5::Parser->load_html(location => $input); print $html->querySelector('h2.product-name a')->textContent, "\n"; print $html->querySelector('div.price-box span.price')->textContent, " +\n";

    Output is:

    Please help $20.00
    use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name
Re: How can i have the titles and the prices?
by Corion (Patriarch) on Sep 19, 2013 at 09:12 UTC

    If the page is simple enough, an approach that makes the HTML parser a bit less daunting would be to use an approach based on CSS selectors. There are several modules that implement this approach, like Mojo::DOM, Web::Magic, Web::Query. Some others like Web::Scraper provide a bit more scaffolding around running data extraction.

    App::scrape is a minimalistic scraper that implements the two steps of 1) fetching an HTML page and 2) extracting data according to CSS selectors. The basic invocation would be like the following (assuming your data lives in a file 1054800.html:

    C:\>scrape file:///1054800.html .product-name .price Please help $20.00

    So, armed with the two selectors, you can then turn from the command line tool to using the selectors with (for example) App::scrape:

    #!perl -w use strict; use App::scrape 'scrape'; use LWP::Simple 'get'; use Data::Dumper; my $html= get 'file:///1054800.html'; my @info = scrape( $html, { product => '.product-name', price => '.price', }, ); print Dumper \@info; __END__ C:\>perl -w tmp.pl $VAR1 = [ { 'product' => 'Please help', 'price' => '$20.00' } ];

    Note that App::scrape assumes that your data is basically tabular. It does not cope well with data with a more complex structure, and especially not well with the situation that one product maybe has no price tag.

Re: How can i have the titles and the prices?
by hdb (Monsignor) on Sep 19, 2013 at 08:28 UTC

    An HTML parser is advisable. However, if you insist on using regexes (which can fail in many ways) two things:

    1. Make them non-greedy by using (.*?). Otherwise, they are trying to match the longest possible string.
    2. Instead of using dot, use a character class excluding the next character after the string you are looking for. So instead of title="(.*)" use title="([^"]*)". This way you get everything up to the next quote.

Re: How can i have the titles and the prices?
by Matt™ (Acolyte) on Sep 19, 2013 at 08:19 UTC
    Use an HTML Parser