Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

LWP fails where browser succeeds?

by jcabraham (Novice)
on Jul 19, 2012 at 20:26 UTC ( #982713=perlquestion: print w/ replies, xml ) Need Help??
jcabraham has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I'm trying to script command-line scraping of a website, a vendor website hosted at my company. There are many levels of redirection one must go through after login, and, while Firefox and Chrome can handle it, LWP seems to generate "Bad Request" responses. After the "SetSessionVars.php" request (below), it returns a bad request response, whereas in the browser it successfully redirectos to the home page. For the life of me I can't figure out what I'm not doing. Here's my code:

my $ua = LWP::UserAgent->new(); push @{ $ua->requests_redirectable }, 'POST'; my $cookies = new HTTP::Cookies(file=>'/Users/jcabraham/.cookies.txt', +autosave=>1, ignore_discard=>1); $ua->cookie_jar($cookies); $ua->default_header('Accept-Encoding' => scalar HTTP::Message::decodab +le()); $ua->add_handler("request_send", sub { shift->dump; return }); $ua->add_handler("response_done", sub { shift->dump; return }); # log off first, just start clean my $auth_response = $ua->request(GET "http://ap1492-dsr/LogOff.php"); # now login my $response = $ua->request(POST "http://ap1492-dsr/authenticate.php", + [user => $authUser, password => $authPw, TimezoneOffset => 14400, su +bmit => 'User Login']); # scrape home page $response = $ua->request(GET "http://ap1492-dsr/Welcome.php"); if ($response->is_success) { my $html = $response->decoded_content; print $html; }

And here's the trace output from LWP:

macbook:scripts jcabraham$ link_aperio.pl 12 12 GET http://ap1492-dsr/LogOff.php Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Cookie: PHPSESSID=1342557122; DontShowDisclaimer80=1 Cookie2: $Version="1" (no content) HTTP/1.1 302 Found Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- +check=0 Connection: close Date: Thu, 19 Jul 2012 20:23:13 GMT Pragma: no-cache Location: Login.php Server: Apache Content-Length: 0 Content-Type: text/html; charset=UTF-8 Expires: Thu, 19 Nov 1981 08:52:00 GMT Client-Date: Thu, 19 Jul 2012 20:23:13 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 X-Powered-By: PHP/5.3.5 (no content) GET http://ap1492-dsr/Login.php Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Cookie: PHPSESSID=1342557122; DontShowDisclaimer80=1 Cookie2: $Version="1" (no content) HTTP/1.1 200 OK Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- +check=0 Connection: close Date: Thu, 19 Jul 2012 20:23:13 GMT Pragma: no-cache Server: Apache Content-Length: 5078 Content-Type: text/html; charset=UTF-8 Expires: Thu, 19 Nov 1981 08:52:00 GMT Client-Date: Thu, 19 Jul 2012 20:23:14 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 Link: <./CSS/masterstyle.css?11.1.1.760>; rel="stylesheet"; type="text +/css" Link: <./CSS/blue.css?11.1.1.760>; rel="stylesheet"; type="text/css" Link: <./CSS/blueLogin.css?11.1.1.760>; rel="stylesheet"; type="text/c +ss" Link: <./CSS/custom.css?11.1.1.760>; rel="stylesheet"; type="text/css" Refresh: text/html Set-Cookie: memory_limit=deleted; expires=Wed, 20-Jul-2011 20:23:12 GM +T; path=/ Set-Cookie: PHPSESSID=1342729393; path=/ Set-Cookie: PHPSESSID=681877b8eaa1b7fd3a35cc9db713cfa7; path=/ Set-Cookie: PHPSESSID=1342557122; path=/; httponly Title: Spectrum - Login X-Powered-By: PHP/5.3.5 \r <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR +/html4/loose.dtd"><html><head><meta content='text/html' http-equiv='r +efresh'> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><TI +TLE>Spectrum - Login</TITLE> <link type='text/css' rel='stylesheet' href='./CSS/masterstyle.css?11. +1.1.760'> <script type='text/javascript' src='./Spectrum.js?11.1.1.760'> </scrip +t> <script type='text/javascript' src='./Keyboard.js?11.1.1.760'> </scrip +t> <script type='text/javascript' src='.... (+ 4566 more bytes not shown) POST http://ap1492-dsr/authenticate.php Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Content-Length: 70 Content-Type: application/x-www-form-urlencoded Cookie: PHPSESSID=1342557122; DontShowDisclaimer80=1 Cookie2: $Version="1" user=jabraham&password=da!syd0g&TimezoneOffset=14400&submit=User+Login HTTP/1.1 302 Found Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- +check=0 Connection: close Date: Thu, 19 Jul 2012 20:23:14 GMT Pragma: no-cache Location: Disclaimer.php Server: Apache Content-Length: 0 Content-Type: text/html; charset=UTF-8 Expires: Thu, 19 Nov 1981 08:52:00 GMT Client-Date: Thu, 19 Jul 2012 20:23:14 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 Set-Cookie: PHPSESSID=1342729394; path=/ X-Powered-By: PHP/5.3.5 (no content) GET http://ap1492-dsr/Disclaimer.php Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1 Cookie2: $Version="1" (no content) HTTP/1.1 302 Found Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- +check=0 Connection: close Date: Thu, 19 Jul 2012 20:23:14 GMT Pragma: no-cache Location: DetermineRole.php Server: Apache Content-Length: 0 Content-Type: text/html; charset=UTF-8 Expires: Thu, 19 Nov 1981 08:52:00 GMT Client-Date: Thu, 19 Jul 2012 20:23:14 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 X-Powered-By: PHP/5.3.5 (no content) GET http://ap1492-dsr/DetermineRole.php Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1 Cookie2: $Version="1" (no content) HTTP/1.1 302 Found Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- +check=0 Connection: close Date: Thu, 19 Jul 2012 20:23:14 GMT Pragma: no-cache Location: DetermineHierarchy.php?RoleId=102&HierarchyId=3 Server: Apache Content-Length: 0 Content-Type: text/html; charset=UTF-8 Expires: Thu, 19 Nov 1981 08:52:00 GMT Client-Date: Thu, 19 Jul 2012 20:23:14 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 X-Powered-By: PHP/5.3.5 (no content) GET http://ap1492-dsr/DetermineHierarchy.php?RoleId=102&HierarchyId=3 Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1 Cookie2: $Version="1" (no content) HTTP/1.1 302 Found Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre- +check=0 Connection: close Date: Thu, 19 Jul 2012 20:23:14 GMT Pragma: no-cache Location: ../SetSessionVars.php?RoleId=102&HierarchyId=3 Server: Apache Content-Length: 0 Content-Type: text/html; charset=UTF-8 Expires: Thu, 19 Nov 1981 08:52:00 GMT Client-Date: Thu, 19 Jul 2012 20:23:14 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 X-Powered-By: PHP/5.3.5 (no content) GET http://ap1492-dsr/../SetSessionVars.php?RoleId=102&HierarchyId=3 Accept-Encoding: gzip, x-gzip, deflate, x-bzip2 User-Agent: libwww-perl/5.837 Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1 Cookie2: $Version="1" (no content) HTTP/1.1 400 Bad Request Connection: close Date: Thu, 19 Jul 2012 20:23:15 GMT Server: Apache Content-Length: 286 Content-Type: text/html; charset=iso-8859-1 Client-Date: Thu, 19 Jul 2012 20:23:14 GMT Client-Peer: 10.100.50.80:80 Client-Response-Num: 1 Title: 400 Bad Request <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>400 Bad Request</title> </head><body> <h1>Bad Request</h1> <p>Your browser sent a request that this server could not understand.< +br /> </p> <hr> <address>Apache Server at ap1492-dsr Port 80</address> </body></html>

Comment on LWP fails where browser succeeds?
Select or Download Code
Re: LWP fails where browser succeeds?
by davido (Archbishop) on Jul 19, 2012 at 21:38 UTC

    Do you happen to know what purpose Javascript plays in this process? Is it possible that it's altering the browser's request? If that's the case, the page's Javascript would never be executed with your Javascript-unaware user agent. This is a pretty common issue.

    Just something to consider. This could be one of those situations where WWW::Mechanize::Firefox comes in handy. If only a true browser can come up with the correct scrape, just let Perl manipulate the browser for you.


    Dave

Re: LWP fails where browser succeeds?
by patcat88 (Deacon) on Jul 19, 2012 at 22:11 UTC
Re: LWP fails where browser succeeds?
by Anonymous Monk on Jul 20, 2012 at 00:18 UTC
Re: LWP fails where browser succeeds?
by Gangabass (Priest) on Jul 20, 2012 at 02:41 UTC

    Try the same thing in your browser but with Javascript disabled (I recommend to clear cookies too).

    Sometimes some sites set cookies via Javascript and if your agent doesn't execute Javascript you'll have "Bad request" error.

    I'm using Firefox HTTPFox for this: just record the session and after that compare cookies send to you from site and cookies send by your browser on next request.

    Anyway you need to send exactly same request to the target site (I mean button coordinates, empty form fields etc) to get it work.I

Re: LWP fails where browser succeeds?
by ckj (Chaplain) on Jul 20, 2012 at 06:19 UTC
    I never used LWP nor will suggest anyone to go with it, I always prefer WWW::Mechanize. Here, people are referring for WWW::Mechanize::Firefox but if you can find out keys like contents, params then it can be easily done using Mechanize module only. Since in Mechanize::Firefox, you need Mozroepl installed in firefox and if suppose the site is not compatible with firefox then there will be an issue. So, better that you should find out the keys using firebug and then use those keys in you code with Mechanize module to get the output. For better explanation, let me know the site URL and things you want to fetch it from there.
Re: LWP fails where browser succeeds?
by tobyink (Abbot) on Jul 20, 2012 at 06:28 UTC

    Your server is almost certainly objecting to this weird URL:

    GET http://ap1492-dsr/../SetSessionVars.php?RoleId=102&HierarchyId=3

    Note the superfluous ../. My copy of Apache returns HTTP 400 Bad Request responses for requests with a leading ../. I notice that the reason LWP is requesting this URL is that it receives an HTTP 302 Found response redirecting to it:

    Location: ../SetSessionVars.php?RoleId=102&HierarchyId=3

    Did you know that strictly speaking the Location header is supposed to contain an absolute URL, not a relative URI reference? Fixing the server to always provide correct absolute URLs in the Location header should solve the issue.

    A workaround could be to set $URI::ABS_REMOTE_LEADING_DOTS to 1, because LWP uses the URI library to resolve relative URI references.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      That was indeed the problem. Thanks for such a great catch!
Re: LWP fails where browser succeeds?
by Ransom (Beadle) on Jul 20, 2012 at 13:32 UTC

    When posting long sections, please use the readmore tags. It really helps to clean up browsing the SoPW section.

    Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://982713]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-12-28 17:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (182 votes), past polls