Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Welcome to the Monastery
 
PerlMonks  

Log In To guardian.co.uk with WWW::Mechanize

by Cody Pendant (Prior)
on May 28, 2005 at 00:56 UTC ( #461269=perlquestion: print w/ replies, xml ) Need Help??
Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to log on to the website of UK newspaper The Guardian using WWW::Mechanize.

Here's the login form: http://users.guardian.co.uk/signin/0,12930,-5,00.html.

If you look at the code you'll see there's some strange JavaScript code going on, hidden fields which are hashed in some way on submission. It's all rather strange.

My code so far is this:

use WWW::Mechanize; my $browser = WWW::Mechanize->new( cookie_jar => {}, autocheck => 1 ); $browser->get( 'http://users.guardian.co.uk/signin/0,12930,-1,00.html' ); $browser->form_name('regpss1') || die "$!"; $browser->set_fields( AU_LOGIN_ID => 'my login', AU_PASSWORD => 'my password' ); $browser->submit() || die "$!"; print $browser->content();

And this is what I get

Method Not Allowed The requested method POST is not allowed for the URL /mydetails/0,,,00.html

OK now we get to the disclosure part, and I'm rather embarrassed by this. I asked the question before (it was 2003 and I was asking about WWW::Automate) and apparently I got a working answer, but I got it via someone's scratchpad and have now mislaid it. See this node.

So, mea culpa but I need your help again!



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on Log In To guardian.co.uk with WWW::Mechanize
Select or Download Code
Re: Log In To guardian.co.uk with WWW::Mechanize
by merzy (Scribe) on May 28, 2005 at 02:03 UTC
    I don't have time to play with this tonight, sorry, but one of the first things I'd do is try the login in firefox with the "Live HTTP Headers" extension turned on. That might give some insight into what's going back and forth.
      Good call, should have thought of that sooner.

      OK this is what I get:

      http://users.guardian.co.uk/signin/tr/1,13542,-1,00.html POST /signin/tr/1,13542,-1,00.html HTTP/1.1 Host: users.guardian.co.uk User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv: +1.7.8) Gecko/20050511 Firefox/1.0.4 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9 +,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://users.guardian.co.uk/signin/0,12930,-1,00.html?AU_LOGI +N_ID=myusername&AU_PASSWORD=%2D%2D%2D%2D%2D%2D%2D%2D&AU_PASSWORD_HASH +=12f6c69cf906afb85852b32bc04e4c19&AU_CHALLENGE=1117250755&AU_CHALLENG +E2=c486109c620b57c4bc69b4792179cdb9 Cookie: GU_MU=UVdvQE44Q29AamtBQUR2T2VMWXxpV3RHNEZCQmhZeVIzbEI5dzlPUWdB +PT0%3d; GU_LOCATION=YXVzOjU6dmk6NDpyaWNobW9uZDozOi0xOmJyb2FkYmFuZDotM +zcuODMzOjE0NS4wMDBAOTAxOTgyNDIxMjQ4OTE1NTYyMjUzNTI0NzUxOTE0MzIwNjc0Mj +Qz; CP=*; GU_ST=http%3A//www.guardian.co.uk/ Content-Type: application/x-www-form-urlencoded Content-Length: 199 AU_LOGIN_ID=myusername&AU_PASSWORD=--------&AU_KEEP_ME_SIGNED_IN=on&AU +_PASSWORD_HASH=f67c849de72c3939d7169374f761ab9e&AU_CHALLENGE=11172509 +06&AU_CHALLENGE2=fd62bbf5c99827b9b738eac3cb566c35 HTTP/1.x 301 Moved Permanently Date: Sat, 28 May 2005 03:29:00 GMT Server: Apache/1.3.33 (Unix) Set-Cookie: GU_ME=myusername; path=/; expires=Thu, 27 May 2010 03:29:0 +2 GMT; domain=.guardian.co.uk Set-Cookie: GU_MI=mi%5Fi%3D872201%3Bmi%5Fp%3DCRE%2CTLK%2CBRF%2CMGU%3Bg +u%5Fpk%3DCRE%2CTLK%2CMGU%3Bmi%5Fe%3D%21200505310329%3Bmi%5Fs%3Dba40d2 +702ddb6ca1d9f0eb8c61793554; path=/; expires=Thu, 27 May 2010 03:29:02 + GMT; domain=.guardian.co.uk; httponly; Set-Cookie: GU_MY=200505280339:67f4730c3bbbccb2723f33abb5d3e922; path= +/; expires=Sat, 28 May 2005 03:39:02 GMT; domain=users.guardian.co.uk +; httponly; Location: /signin/status/tr/1,13608,-1,00.html Cache-Control: no-cache Pragma: no-cache Expires: 0 Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 ---------------------------------------------------------- http://users.guardian.co.uk/signin/status/tr/1,13608,-1,00.html GET /signin/status/tr/1,13608,-1,00.html HTTP/1.1 Host: users.guardian.co.uk User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv: +1.7.8) Gecko/20050511 Firefox/1.0.4 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9 +,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://users.guardian.co.uk/signin/0,12930,-1,00.html?AU_LOGI +N_ID=myusername&AU_PASSWORD=%2D%2D%2D%2D%2D%2D%2D%2D&AU_PASSWORD_HASH +=12f6c69cf906afb85852b32bc04e4c19&AU_CHALLENGE=1117250755&AU_CHALLENG +E2=c486109c620b57c4bc69b4792179cdb9 Cookie: GU_MU=UVdvQE44Q29AamtBQUR2T2VMWXxpV3RHNEZCQmhZeVIzbEI5dzlPUWdB +PT0%3d; GU_LOCATION=YXVzOjU6dmk6NDpyaWNobW9uZDozOi0xOmJyb2FkYmFuZDotM +zcuODMzOjE0NS4wMDBAOTAxOTgyNDIxMjQ4OTE1NTYyMjUzNTI0NzUxOTE0MzIwNjc0Mj +Qz; CP=*; GU_ST=http%3A//www.guardian.co.uk/; GU_ME=myusername; GU_MI +=mi%5Fi%3D872201%3Bmi%5Fp%3DCRE%2CTLK%2CBRF%2CMGU%3Bgu%5Fpk%3DCRE%2CT +LK%2CMGU%3Bmi%5Fe%3D%21200505310329%3Bmi%5Fs%3Dba40d2702ddb6ca1d9f0eb +8c61793554; GU_MY=200505280339:67f4730c3bbbccb2723f33abb5d3e922 HTTP/1.x 301 Moved Permanently Date: Sat, 28 May 2005 03:29:03 GMT Server: Apache/1.3.33 (Unix) Set-Cookie: GU_ME=myusername; path=/; expires=Thu, 27 May 2010 03:29:0 +5 GMT; domain=.guardian.co.uk Set-Cookie: GU_MI=mi%5Fi%3D872201%3Bmi%5Fp%3DCRE%2CTLK%2CBRF%2CMGU%3Bg +u%5Fpk%3DCRE%2CTLK%2CMGU%3Bmi%5Fe%3D%21200505310329%3Bmi%5Fs%3Dba40d2 +702ddb6ca1d9f0eb8c61793554; path=/; expires=Thu, 27 May 2010 03:29:05 + GMT; domain=.guardian.co.uk; httponly; Set-Cookie: GU_ST=; path=/; domain=.guardian.co.uk Location: http://www.guardian.co.uk/ Cache-Control: no-cache Pragma: no-cache Expires: 0 Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 ---------------------------------------------------------- http://www.guardian.co.uk/

      At which point I'm taken to the front page and I'm logged in.



      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print
        So what do you think you should do now?
        Still no time to work on this, but I'm curious enough to poke at it every once in a while. Between different requests to the login page, here's what changes:
        [11:23am] eero:~/tmp/guardian: diff 0,12930,-1,00.html o 236c236 < <input type="hidden" name="AU_CHALLENGE" value="1117293798"><input t +ype="hidden" name="AU_CHALLENGE2" value="af7fb54d3a917e272e2b7abe1353 +bd51"></form></table></td></tr></table> --- > <input type="hidden" name="AU_CHALLENGE" value="1117293788"><input t +ype="hidden" name="AU_CHALLENGE2" value="59e3978f05fde8396395a576645c +d04b"></form></table></td></tr></table> [11:23am] eero:~/tmp/guardian:
        ...and here's where in the page source the work is done:
        function preparePassword() { var form = document.regpss1; var dummy = '----------------------------------------'; form.AU_PASSWORD_HASH.value = binl2hex(core_hmac_md5(form. +AU_CHALLENGE2.value,form.AU_PASSWORD.value)); form.AU_PASSWORD.value = dummy.substr(0,form.AU_PASSWORD.v +alue.length); regpss_submitted = true; form.submit(); }

        I'm guessing that you'll need to take your password, run it through that hashing sequence and then return that as the actual password in the post. Or something like that.

        I'm surprised nobody's done this yet.
Re: Log In To guardian.co.uk with WWW::Mechanize
by DaWolf (Curate) on May 28, 2005 at 14:57 UTC
    Notice that your script is somehow sending you to a page in "mydetails", wich doesn't happen with the normal login process (take a look at the LiveHTTP log, you are not at any point being redirected to any page on mydetails directory).

    So, for some reason, you are trying to post your data to a page that doesn't support the POST method, wich I think is done by some kind of "protection" against this kind of method.

    I've tried playing with WWW::Mechanize once and I've found out that while I could easily use it to send an email on my site contact page, I couldn't use it to to manipulate other sites.
Re: Log In To guardian.co.uk with WWW::Mechanize
by Adrade (Pilgrim) on May 31, 2005 at 03:47 UTC
    What seems to be the problem to me is that I think the http standard calls for Location: redirected requests to be in the same method as the original call (if a POSTed page redirects to another, that page should also be POSTed) - this, I think, is what WWW::Mechanize does - but not what Firefox and other standard browsers are doing, a behavior that the site developers are taking into account (even though they should be using a Status: 303 See Other, not a 301)... What you want to do is load up the cookie_jar with the authentication information, then request the particular pages you're looking for - you're falling to an error because autocheck is set to on, and when Mechanize POSTs to a page that expects a GET, it checks to see if it worked, realizes that it didnt, and all goes ka-ploowey. So... what you need to do, is authenticate yourself, like you wonderfully did (but with autocheck off)... then go ahead an request the user-particular webpage from which you wish to pull data: for instance, this modification of your code will authenticate you, and then pull up the 'mydetails' page:


    Now, there's no reason to parse all that funky javascript - lots of folks have js turned off in their browsers - if the guardian didnt allow these people to browse their site, they would be losing a good portion of their readers - all that js hashing is for added security, but isn't required - as the above code demonstrates.

    I hope this helps - I mean given your request, I think this is what you're looking for. And you should give yourself a pat on the back - you were like 98% right already!

    Best,
      -Adam
      Thanks for that. Interesting stuff, and thanks for the encouragement. I promise not to lose the code and come back and ask again in another two years.


      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://461269]
Approved by Popcorn Dave
Front-paged by tlm
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (12)
As of 2014-04-17 17:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (453 votes), past polls