How can i have the titles and the prices?

faozhi has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, i have the below html code from a page and i would like to extract the title and the price, but it is taking me so long.
This is the sample of the html file i am parsing:

 <td width="135px"> <a href="google.com" title="Please help " class="p
+roduct-image"><img src="google.com/blabla.jpg" width="135" height="18
+0" alt="Please help " /></a></td>
                    <td valign="top">
                    <div class="category-description">
                    <h2 class="product-name"><a href="google.com" titl
+e="Please help ">Please help </a></h2>
                    <strong>Doodle Thomson </strong> <br/>
                    
                    At a time when many people are attempting to relat
+e current events and trends in the world to interpretations of the pr
+ophecies contained in the Book of Revelations, and the writings of No
+stradamus, and the predictions of fashionable clairvoyants, the autho
+r does much the same                    </div>
                    </td>
                    <td width="30%">
                        <div class="categoty_price">
                        
                               

                
    <div class="price-box">
                                                            <span clas
+s="regular-price" id="product-price-1139">
                                            <span class="price">$20.00
+</span>                                    </span>
[download]

I need the title (Please help) and the price ($20.00). It is a long html file with many more of these.
Please help. This is the code that i have so far but failing me...

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

if (@ARGV != 1 ) { die "./quick.book.parsing.pl <html file>\n";}

#open html file

open(HTML,$ARGV[0]) || die "Couldn't open file $ARGV[0]\n";
while (my $html = <HTML>)
{
        next if $html =~ /<button/;
        chomp $html;
        if ($html =~m/title="(.*)" class/g)
        {
                my @columns =~ split (/\"/, $html);
                print "$columns[5]";
        }
}




#{
#       next if $html =~ /<button/;
#       chomp $html;
#       if ($html=~ m/title="(.*)" class/g)
#{
#       $columns[5] =~ s/^\s+|\s+$//g;
#       print "$columns[5]---";
#}
#
#       if ($html=~ m/<span class="price">(.*)<\/span>/g)
#{
#       $html =~ s/^\s+|\s+$//g;
#       print "$html\n";
#}
#}

close HTML;

exit;
[download]

Comment on How can i have the titles and the prices? Select or Download Code

Replies are listed 'Best First'.
Re: How can i have the titles and the prices? by tobyink (Canon) on Sep 19, 2013 at 08:25 UTC
Don't parse HTML with regexps. Use an HTML parser. `#!/usr/bin/env perl use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; my $input = shift or die("Usage: $0 filename\n"); my $html = HTML::HTML5::Parser->load_html(location => $input); print $html->querySelector('h2.product-name a')->textContent, "\n"; print $html->querySelector('div.price-box span.price')->textContent, " +\n";` [download] Output is: `Please help $20.00` [download] `use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name`	[reply] [d/l] [select]
Re: How can i have the titles and the prices? by Corion (Patriarch) on Sep 19, 2013 at 09:12 UTC
If the page is simple enough, an approach that makes the HTML parser a bit less daunting would be to use an approach based on CSS selectors. There are several modules that implement this approach, like Mojo::DOM, Web::Magic, Web::Query. Some others like Web::Scraper provide a bit more scaffolding around running data extraction. App::scrape is a minimalistic scraper that implements the two steps of 1) fetching an HTML page and 2) extracting data according to CSS selectors. The basic invocation would be like the following (assuming your data lives in a file `1054800.html`: `C:\>scrape file:///1054800.html .product-name .price Please help $20.00` [download] So, armed with the two selectors, you can then turn from the command line tool to using the selectors with (for example) App::scrape: `#!perl -w use strict; use App::scrape 'scrape'; use LWP::Simple 'get'; use Data::Dumper; my $html= get 'file:///1054800.html'; my @info = scrape( $html, { product => '.product-name', price => '.price', }, ); print Dumper \@info; __END__ C:\>perl -w tmp.pl $VAR1 = [ { 'product' => 'Please help', 'price' => '$20.00' } ];` [download] Note that App::scrape assumes that your data is basically tabular. It does not cope well with data with a more complex structure, and especially not well with the situation that one product maybe has no price tag.	[reply] [d/l] [select]
Re: How can i have the titles and the prices? by hdb (Monsignor) on Sep 19, 2013 at 08:28 UTC
An HTML parser is advisable. However, if you insist on using regexes (which can fail in many ways) two things: Make them non-greedy by using `(.?)`. Otherwise, they are trying to match the longest possible string. Instead of using dot, use a character class excluding the next character after the string you are looking for. So instead of `title="(.)"` use `title="([^"]*)"`. This way you get everything up to the next quote.	[reply] [d/l] [select]
Re: How can i have the titles and the prices? by Matt™ (Acolyte) on Sep 19, 2013 at 08:19 UTC
Use an HTML Parser	[reply]

Back to Seekers of Perl Wisdom