Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

How can i have the titles and the prices?

by faozhi (Acolyte)
on Sep 19, 2013 at 08:05 UTC ( #1054800=perlquestion: print w/ replies, xml ) Need Help??
faozhi has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, i have the below html code from a page and i would like to extract the title and the price, but it is taking me so long.
This is the sample of the html file i am parsing:
<td width="135px"> <a href="google.com" title="Please help " class="p +roduct-image"><img src="google.com/blabla.jpg" width="135" height="18 +0" alt="Please help " /></a></td> <td valign="top"> <div class="category-description"> <h2 class="product-name"><a href="google.com" titl +e="Please help ">Please help </a></h2> <strong>Doodle Thomson </strong> <br/> At a time when many people are attempting to relat +e current events and trends in the world to interpretations of the pr +ophecies contained in the Book of Revelations, and the writings of No +stradamus, and the predictions of fashionable clairvoyants, the autho +r does much the same </div> </td> <td width="30%"> <div class="categoty_price"> <div class="price-box"> <span clas +s="regular-price" id="product-price-1139"> <span class="price">$20.00 +</span> </span>
I need the title (Please help) and the price ($20.00). It is a long html file with many more of these.
Please help. This is the code that i have so far but failing me...
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; if (@ARGV != 1 ) { die "./quick.book.parsing.pl <html file>\n";} #open html file open(HTML,$ARGV[0]) || die "Couldn't open file $ARGV[0]\n"; while (my $html = <HTML>) { next if $html =~ /<button/; chomp $html; if ($html =~m/title="(.*)" class/g) { my @columns =~ split (/\"/, $html); print "$columns[5]"; } } #{ # next if $html =~ /<button/; # chomp $html; # if ($html=~ m/title="(.*)" class/g) #{ # $columns[5] =~ s/^\s+|\s+$//g; # print "$columns[5]---"; #} # # if ($html=~ m/<span class="price">(.*)<\/span>/g) #{ # $html =~ s/^\s+|\s+$//g; # print "$html\n"; #} #} close HTML; exit;

Comment on How can i have the titles and the prices?
Select or Download Code
Re: How can i have the titles and the prices?
by Matt™ (Acolyte) on Sep 19, 2013 at 08:19 UTC
    Use an HTML Parser
Re: How can i have the titles and the prices?
by tobyink (Abbot) on Sep 19, 2013 at 08:25 UTC

    Don't parse HTML with regexps. Use an HTML parser.

    #!/usr/bin/env perl use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; my $input = shift or die("Usage: $0 filename\n"); my $html = HTML::HTML5::Parser->load_html(location => $input); print $html->querySelector('h2.product-name a')->textContent, "\n"; print $html->querySelector('div.price-box span.price')->textContent, " +\n";

    Output is:

    Please help $20.00
    use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name
Re: How can i have the titles and the prices?
by hdb (Parson) on Sep 19, 2013 at 08:28 UTC

    An HTML parser is advisable. However, if you insist on using regexes (which can fail in many ways) two things:

    1. Make them non-greedy by using (.*?). Otherwise, they are trying to match the longest possible string.
    2. Instead of using dot, use a character class excluding the next character after the string you are looking for. So instead of title="(.*)" use title="([^"]*)". This way you get everything up to the next quote.

Re: How can i have the titles and the prices?
by Corion (Pope) on Sep 19, 2013 at 09:12 UTC

    If the page is simple enough, an approach that makes the HTML parser a bit less daunting would be to use an approach based on CSS selectors. There are several modules that implement this approach, like Mojo::DOM, Web::Magic, Web::Query. Some others like Web::Scraper provide a bit more scaffolding around running data extraction.

    App::scrape is a minimalistic scraper that implements the two steps of 1) fetching an HTML page and 2) extracting data according to CSS selectors. The basic invocation would be like the following (assuming your data lives in a file 1054800.html:

    C:\>scrape file:///1054800.html .product-name .price Please help $20.00

    So, armed with the two selectors, you can then turn from the command line tool to using the selectors with (for example) App::scrape:

    #!perl -w use strict; use App::scrape 'scrape'; use LWP::Simple 'get'; use Data::Dumper; my $html= get 'file:///1054800.html'; my @info = scrape( $html, { product => '.product-name', price => '.price', }, ); print Dumper \@info; __END__ C:\>perl -w tmp.pl $VAR1 = [ { 'product' => 'Please help', 'price' => '$20.00' } ];

    Note that App::scrape assumes that your data is basically tabular. It does not cope well with data with a more complex structure, and especially not well with the situation that one product maybe has no price tag.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1054800]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2014-07-31 05:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (245 votes), past polls