Hi All,
I'm trying to script command-line scraping of a website, a vendor website hosted at my company. There are many levels of redirection one must go through after login, and, while Firefox and Chrome can handle it, LWP seems to generate "Bad Request" responses. After the "SetSessionVars.php" request (below), it returns a bad request response, whereas in the browser it successfully redirectos to the home page. For the life of me I can't figure out what I'm not doing.
Here's my code:
my $ua = LWP::UserAgent->new();
push @{ $ua->requests_redirectable }, 'POST';
my $cookies = new HTTP::Cookies(file=>'/Users/jcabraham/.cookies.txt',
+autosave=>1, ignore_discard=>1);
$ua->cookie_jar($cookies);
$ua->default_header('Accept-Encoding' => scalar HTTP::Message::decodab
+le());
$ua->add_handler("request_send", sub { shift->dump; return });
$ua->add_handler("response_done", sub { shift->dump; return });
# log off first, just start clean
my $auth_response = $ua->request(GET "http://ap1492-dsr/LogOff.php");
# now login
my $response = $ua->request(POST "http://ap1492-dsr/authenticate.php",
+ [user => $authUser, password => $authPw, TimezoneOffset => 14400, su
+bmit => 'User Login']);
# scrape home page
$response = $ua->request(GET "http://ap1492-dsr/Welcome.php");
if ($response->is_success) {
my $html = $response->decoded_content;
print $html;
}
And here's the trace output from LWP:
macbook:scripts jcabraham$ link_aperio.pl 12 12
GET http://ap1492-dsr/LogOff.php
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: PHPSESSID=1342557122; DontShowDisclaimer80=1
Cookie2: $Version="1"
(no content)
HTTP/1.1 302 Found
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-
+check=0
Connection: close
Date: Thu, 19 Jul 2012 20:23:13 GMT
Pragma: no-cache
Location: Login.php
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Thu, 19 Jul 2012 20:23:13 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
X-Powered-By: PHP/5.3.5
(no content)
GET http://ap1492-dsr/Login.php
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: PHPSESSID=1342557122; DontShowDisclaimer80=1
Cookie2: $Version="1"
(no content)
HTTP/1.1 200 OK
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-
+check=0
Connection: close
Date: Thu, 19 Jul 2012 20:23:13 GMT
Pragma: no-cache
Server: Apache
Content-Length: 5078
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Thu, 19 Jul 2012 20:23:14 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
Link: <./CSS/masterstyle.css?11.1.1.760>; rel="stylesheet"; type="text
+/css"
Link: <./CSS/blue.css?11.1.1.760>; rel="stylesheet"; type="text/css"
Link: <./CSS/blueLogin.css?11.1.1.760>; rel="stylesheet"; type="text/c
+ss"
Link: <./CSS/custom.css?11.1.1.760>; rel="stylesheet"; type="text/css"
Refresh: text/html
Set-Cookie: memory_limit=deleted; expires=Wed, 20-Jul-2011 20:23:12 GM
+T; path=/
Set-Cookie: PHPSESSID=1342729393; path=/
Set-Cookie: PHPSESSID=681877b8eaa1b7fd3a35cc9db713cfa7; path=/
Set-Cookie: PHPSESSID=1342557122; path=/; httponly
Title: Spectrum - Login
X-Powered-By: PHP/5.3.5
\r
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR
+/html4/loose.dtd"><html><head><meta content='text/html' http-equiv='r
+efresh'>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><TI
+TLE>Spectrum - Login</TITLE>
<link type='text/css' rel='stylesheet' href='./CSS/masterstyle.css?11.
+1.1.760'>
<script type='text/javascript' src='./Spectrum.js?11.1.1.760'> </scrip
+t>
<script type='text/javascript' src='./Keyboard.js?11.1.1.760'> </scrip
+t>
<script type='text/javascript' src='....
(+ 4566 more bytes not shown)
POST http://ap1492-dsr/authenticate.php
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Content-Length: 70
Content-Type: application/x-www-form-urlencoded
Cookie: PHPSESSID=1342557122; DontShowDisclaimer80=1
Cookie2: $Version="1"
user=jabraham&password=da!syd0g&TimezoneOffset=14400&submit=User+Login
HTTP/1.1 302 Found
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-
+check=0
Connection: close
Date: Thu, 19 Jul 2012 20:23:14 GMT
Pragma: no-cache
Location: Disclaimer.php
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Thu, 19 Jul 2012 20:23:14 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
Set-Cookie: PHPSESSID=1342729394; path=/
X-Powered-By: PHP/5.3.5
(no content)
GET http://ap1492-dsr/Disclaimer.php
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1
Cookie2: $Version="1"
(no content)
HTTP/1.1 302 Found
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-
+check=0
Connection: close
Date: Thu, 19 Jul 2012 20:23:14 GMT
Pragma: no-cache
Location: DetermineRole.php
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Thu, 19 Jul 2012 20:23:14 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
X-Powered-By: PHP/5.3.5
(no content)
GET http://ap1492-dsr/DetermineRole.php
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1
Cookie2: $Version="1"
(no content)
HTTP/1.1 302 Found
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-
+check=0
Connection: close
Date: Thu, 19 Jul 2012 20:23:14 GMT
Pragma: no-cache
Location: DetermineHierarchy.php?RoleId=102&HierarchyId=3
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Thu, 19 Jul 2012 20:23:14 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
X-Powered-By: PHP/5.3.5
(no content)
GET http://ap1492-dsr/DetermineHierarchy.php?RoleId=102&HierarchyId=3
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1
Cookie2: $Version="1"
(no content)
HTTP/1.1 302 Found
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-
+check=0
Connection: close
Date: Thu, 19 Jul 2012 20:23:14 GMT
Pragma: no-cache
Location: ../SetSessionVars.php?RoleId=102&HierarchyId=3
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Thu, 19 Jul 2012 20:23:14 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
X-Powered-By: PHP/5.3.5
(no content)
GET http://ap1492-dsr/../SetSessionVars.php?RoleId=102&HierarchyId=3
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: PHPSESSID=1342729394; DontShowDisclaimer80=1
Cookie2: $Version="1"
(no content)
HTTP/1.1 400 Bad Request
Connection: close
Date: Thu, 19 Jul 2012 20:23:15 GMT
Server: Apache
Content-Length: 286
Content-Type: text/html; charset=iso-8859-1
Client-Date: Thu, 19 Jul 2012 20:23:14 GMT
Client-Peer: 10.100.50.80:80
Client-Response-Num: 1
Title: 400 Bad Request
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<
+br />
</p>
<hr>
<address>Apache Server at ap1492-dsr Port 80</address>
</body></html>