Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Secure Site Login with Mechanize

by naikonta (Curate)
on May 22, 2007 at 04:43 UTC ( [id://616684]=note: print w/replies, xml ) Need Help??


in reply to Secure Site Login with Mechanize

Hi lazibowel

Scraping on login pages is not always straightforward as we want, and mostly tricky. And it's different from one site to another. I think there are four things to be considered:

  • Cookie, but since you use WWW::Mechanize this shouldn't be much problem since the module initializes an empty cookie jar (overrides LWP::UserAgent option).
  • SSL connection, this doesn't seem to be your problem either since you said you have Crypt::SSLeay installed and your script works with Yahoo!
  • Redirection, most sites perform a few redirections to get to the final URL target that does the actual authentication
  • JavaScript, some sites returns page contains JavaScript codes in which it store a URL to be fetched in the next sequence. The site my program is targetting is one such example. Another variation is probably a hidden (i)frame.

Update: There might be also the fifth thing: the old trick of URL referer (HTTP_REFERER). Sometimes it bites until we know that some of the process check for HTTP_REFERER header. I just remembered this one but actually never considered it as much as I did years ago.

So you need to closely watch every transaction in detail between the site and the browser before conding the emulation. One way is to use LiveHTTPHeaders extension for FireFox/Mozilla. Another way is to use HTTP::Recorder but I failed with this one though the docs is very straightforward and never tried it again. There are some nodes discussing this module you might want to inspect. After some googling, I then found Web Scraping Proxy (WSP), along with the article that talks about it.

Well, this is not easy actually because I still needed to construct my own final scraper program based on skeleton produced by WSP. I had to remove some unwanted and unrelevant transactions (such image fetching), and I had to examine the returned pages (we can control amount of pages to produce). But, WSP did help me to decide what requests to emulate. Below is the stripped-version of my final program that logs in to the target site (my network provider company website) and fetches quota information. Note that at some point the site returns a page contains a JavaScript code which in turn contains a URL needed to be fetched in the next sequence. So in summary, my script fetches four different URLs and parse two returned pages to do the job.

#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::TokeParser; use File::Basename; use subs 'fire'; my $basedir = $0; $basedir = dirname $basedir; my %auth = (username => 'username', password => 'password'); my $ua = LWP::UserAgent->new(cookie_jar => {}); my($content, $parser); #### INITIAL REQUEST fire get => 'http://www.example.com/mainpage.php'; ### LOGIN fire post => 'https://ip.address/session.php', \%auth; # extract javascript content $parser = HTML::TokeParser->new(\$content); my $next_url; while (my $token = $parser->get_token) { next unless $token->[0] eq 'S' && $token->[1] eq 'script'; $next_url = $token->[2]{src}, last if $token->[2]{src}; } ### URL by JavaScript fire get => $next_url; # get the real content fire get => "https://the.same.ip.address/?"; # final page $parser = HTML::TokeParser->new(\$content); # parse quota info from this page sub fire { my($method, @args) = @_; my $res = $ua->$method(@args); #print STDERR "Checking for $args[0]\n"; if ($res->is_success) { $content = $res->content } else { die $res->status_line . "\n" } }

Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://616684]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-05-30 03:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found