Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

LWP::UserAgent gets 403 when other browser does not

by pileofrogs (Priest)
on Jul 01, 2010 at 19:56 UTC ( [id://847604]=perlquestion: print w/replies, xml ) Need Help??

pileofrogs has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks!

I'm writing a link checker and I've noticed that I get '403' errors for some pages with LWP::UserAgent, but if I try the same URLs from firefox, I get the content, no problem. I'm doing $ua->get($uri) and $ua->head($uri) requests.

My first thought was that LWP::UserAgent was doing some kind of robots.txt handling without telling me, but there's LWP::RobotUA for that, so I assume that's not the case. (Is my assumption wrong?)

Does anyone know a reason why certain web pages would return a status 403 to LWP::UserAgent, but work fine with a normal web browser?

Thanks!

--Pileofrogs

Replies are listed 'Best First'.
Re: LWP::UserAgent gets 403 when other browser does not
by ikegami (Patriarch) on Jul 01, 2010 at 20:00 UTC

    Most likely because required cookies are missing or because of some robot detection being done by the server.

    Match your request more closely with what Firefox sends, possibly going as far as using Mozilla to do the request.

Re: LWP::UserAgent gets 403 when other browser does not
by pemungkah (Priest) on Jul 01, 2010 at 23:42 UTC
    WWW::Mechanize supports user agent aliases right out of the box. I too lean toward robot detection via signature, and suggest either Firefox or IE as your most-likely-to-be-accepted alternatives.
      So does LWP::UserAgent. The only difference is that WWW::Mechanize provides these aliases:
      'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5. +1)', 'Windows Mozilla' => 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; +rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6', 'Mac Safari' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) +AppleWebKit/85 (KHTML, like Gecko) Safari/85', 'Mozilla' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; +en-US; rv:1.4a) Gecko/20030401', 'Linux Mozilla' => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) +Gecko/20030624', 'Linux Konqueror' => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)',
Re: LWP::UserAgent gets 403 when other browser does not
by pileofrogs (Priest) on Jul 26, 2010 at 16:33 UTC

    Thanks all!

    I decided that I would try multiple times with different settings before declaring a link bad. Some servers don't do HEAD requests, others need a useragent and some want a referrer.

    This makes the whole thing slower, but that doesn't matter when it runs automatically at night.

    Thanks again!

    --Pileofrogs

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://847604]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-19 09:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found