Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Price finding...

by hodge-podge (Novice)
on Jun 08, 2009 at 17:25 UTC ( #769621=perlquestion: print w/ replies, xml ) Need Help??
hodge-podge has asked for the wisdom of the Perl Monks concerning the following question:

Alright, so I am trying to right a script that will search ebay and amazon for a product that the user entered, and returns just the price.
use LWP; use LWP 5.64; use URI; LWP::Simple my $browser = LWP::UserAgent->new; $prod = <>; my $url = URI->new( 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-a +lias%3Dtools&field-keywords=$prod;' ); $url->query_form{ 'product number' => <>; } my $response = $browser->get( $url );
This is all I have so far as I am running into the problem of recognizing the price alone in the html. Any help would be greatly appreciated... Thanks.

Comment on Price finding...
Download Code
Re: Price finding...
by Corion (Pope) on Jun 08, 2009 at 17:30 UTC

    Personally, I use Web::Scraper for extracting data from HTML. You might also find the WWW::Search modules to be of use. But in the long run, you will have to do some work yourself and use an HTML-parser to extract the data you want from the HTML.

Re: Price finding...
by Your Mother (Canon) on Jun 08, 2009 at 17:48 UTC

    Don't. At least for Amazon.com. They have a fairly deep API into the stuff (you have to sign up for a dev token). This is new and nice -- URI::Amazon::APA; and there are a couple of other things on the CPAN to get into there.

    You get back XML for different search types and do whatever you want with it; info including used items, lowest prices, related items, reviews, ratings, whatever. It's not super easy (because there are so many options) but they have pretty good docs and it will always be more robust that HTML parsing. They even have an XSLT service too so you can even get your stuff back preformatted.

Re: Price finding...
by CountZero (Bishop) on Jun 08, 2009 at 18:02 UTC
    You are aware that the terms of use of Amazon include the following:
    This license does not include (...) any use of data mining, robots, or similar data gathering and extraction tools.
    And eBay does not allow you to
    bypass our robot exclusion headers or other measures we may use to prevent or restrict access to the sites.
    But eBay's robots.txt is as follows:
    ### BEGIN FILE ### # # allow-all # # # The use of robots or other automated means to access the eBay site # without the express permission of eBay is strictly prohibited. # Notwithstanding the foregoing, eBay may permit automated access to # access certain eBay pages but soley for the limited purpose of # including content in publicly available search engines. Any other # use of robots or failure to obey the robots exclusion standards set # forth at <http://www.robotstxt.org/ wc/ exclusion.html> is strictly # prohibited. # User-agent: eBay-crawler Disallow: User-agent: * Disallow: /disney/ ### END FILE ###
    As can readily be seen, this robots.txt is really broken: The text part of it does not allow web-scraping and only very limited use of crawling when permission is granted, but its rules only disallow crawler-access to the /disney/ pages (copyright and licensing issues no doubt) for all web-crawlers and the eBay-crawler can go anywhere. So there is a incompatibility between the text and the rules. I'd say that in this case, the rules win over the text since the robot is not required to read (and act upon) the text. Still, it is very sloppy from eBay to publish such a file. On the other hand, there is a finality build into the text: even if your "robot" is allowed access he may only do so "for the limited purpose of including content in publicly available search engines" and all other use or automated access is forbidden. And there is something to say for the argument that this is directed at a human and not at a computer. So, there is a risk that your scraping of eBay is wrong, especially if you specifically target their site and not stumble upon it as a "dumb" crawler is likely to do.

    Update: You may find it silly that these sites forbid you to scrape their web-pages but on the other hand have a published API to get you the same (and even more) info (than) you can scrape from your screen. They may have very good reasons to force you into using the published API as it almost guarantees you get "good data", they can control who is allowed to use this data (you must apply for a user ID) and is generally "cheaper" on their resources. So I advise you to get a licence to be allowed to use the API and then use that.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Price finding...
by superfrink (Curate) on Jun 08, 2009 at 19:53 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://769621]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2014-09-01 18:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (15 votes), past polls