Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Scrappy user_agent error

by docster (Novice)
on Jan 03, 2012 at 16:32 UTC ( #946090=perlquestion: print w/replies, xml ) Need Help??
docster has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I am attempting to write a web crawler in Scrappy. All I can find are posts on how easy it is to do, but not really how to... To me, it is like swimming through mud. I seem to have one issue after another with it. I thought maybe I had some corrupt perl modules so I tested it on a couple of different machines, Mac and Ubuntu Linux. They act the same. In this script it is the user_agent. This is pretty much directly from cpan. What am I missing? ( Posts to stop breathing and die will be ignored. ):
#!/opt/local/bin/perl use strict; use warnings; use Scrappy qw/:syntax/; user_agent random_ua; my $url = ''; my $scraper = Scrappy->new; $scraper->get("$url"); print $scraper->domain; # print __END__ This script returns: Can't locate object method "user_agent" via package "random_ua" (perha +ps you forgot to load "random_ua"?) at ./ line 5.

Replies are listed 'Best First'.
Re: Scrappy user_agent error
by marto (Archbishop) on Jan 03, 2012 at 16:43 UTC

    Where are you copying and pasting your code from? What do you expect user_agent random_ua; to do?

      Some sites I visit automatically block robots. I was under the impression that this would change the default "Browser" id... Am I wrong?
      From CPAN: The user_agent attribute holds the Scrappy::Scraper::UserAg +ent object which is used to set and manipulate the user-agent header +of the scraper. use Scrappy qw/:syntax/; user_agent random_ua; or user_agent random_ua 'firefox'; # firefox only user_agent random_ua 'firefox', 'linux'; # firefox on linux only

        Could it be that you are using a recent scrappy (0.9xxx) but reading the documentation for an older version (like 0.6xxx)? I could find code like "qw/:syntax/" only in older documentation and in example scripts on the web (with a quick google search)

        Have you checked (in their terms of use) that the sites that "automatically block robots" allow scraping? It would be pretty unusual to block robots and allow scraping!

        True laziness is hard work
Re: Scrappy user_agent error
by Anonymous Monk on Jan 04, 2012 at 00:12 UTC
Re: Scrappy user_agent error
by docster (Novice) on Jan 06, 2012 at 16:58 UTC
    I am not trying to do anything malicious or hammer sites. All I really wanted to do was download the Alabama City list from wikipedia, once, and parse it correctly :o)

    I decided to do it as a learning experience in Perl web scraping. But if you connect to wikipedia with Web::Scrape it refuses a connection with "bad host name" or "invalid user agent" ect... Scrappy was supposed to let you tweak the user_agent, which is why I chose that package but so far no one really knows how... I could have easily copied and pasted the information long before now. But that is not as challenging and time consuming, or fun. I enjoy solving challenges with Perl. It is truly the work horse of the Internet.

    Thanks for all the tips. I may look into some of the other examples posted here. Scrappy looks promising but I think I need to work with an established method rather than an emerging one at this point. :)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://946090]
Approved by marto
What's the matter? Cat got your tongue?...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2018-06-25 03:04 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.