Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hi lazibowel

Scraping on login pages is not always straightforward as we want, and mostly tricky. And it's different from one site to another. I think there are four things to be considered:

  • Cookie, but since you use WWW::Mechanize this shouldn't be much problem since the module initializes an empty cookie jar (overrides LWP::UserAgent option).
  • SSL connection, this doesn't seem to be your problem either since you said you have Crypt::SSLeay installed and your script works with Yahoo!
  • Redirection, most sites perform a few redirections to get to the final URL target that does the actual authentication
  • JavaScript, some sites returns page contains JavaScript codes in which it store a URL to be fetched in the next sequence. The site my program is targetting is one such example. Another variation is probably a hidden (i)frame.

Update: There might be also the fifth thing: the old trick of URL referer (HTTP_REFERER). Sometimes it bites until we know that some of the process check for HTTP_REFERER header. I just remembered this one but actually never considered it as much as I did years ago.

So you need to closely watch every transaction in detail between the site and the browser before conding the emulation. One way is to use LiveHTTPHeaders extension for FireFox/Mozilla. Another way is to use HTTP::Recorder but I failed with this one though the docs is very straightforward and never tried it again. There are some nodes discussing this module you might want to inspect. After some googling, I then found Web Scraping Proxy (WSP), along with the article that talks about it.

Well, this is not easy actually because I still needed to construct my own final scraper program based on skeleton produced by WSP. I had to remove some unwanted and unrelevant transactions (such image fetching), and I had to examine the returned pages (we can control amount of pages to produce). But, WSP did help me to decide what requests to emulate. Below is the stripped-version of my final program that logs in to the target site (my network provider company website) and fetches quota information. Note that at some point the site returns a page contains a JavaScript code which in turn contains a URL needed to be fetched in the next sequence. So in summary, my script fetches four different URLs and parse two returned pages to do the job.

#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::TokeParser; use File::Basename; use subs 'fire'; my $basedir = $0; $basedir = dirname $basedir; my %auth = (username => 'username', password => 'password'); my $ua = LWP::UserAgent->new(cookie_jar => {}); my($content, $parser); #### INITIAL REQUEST fire get => 'http://www.example.com/mainpage.php'; ### LOGIN fire post => 'https://ip.address/session.php', \%auth; # extract javascript content $parser = HTML::TokeParser->new(\$content); my $next_url; while (my $token = $parser->get_token) { next unless $token->[0] eq 'S' && $token->[1] eq 'script'; $next_url = $token->[2]{src}, last if $token->[2]{src}; } ### URL by JavaScript fire get => $next_url; # get the real content fire get => "https://the.same.ip.address/?"; # final page $parser = HTML::TokeParser->new(\$content); # parse quota info from this page sub fire { my($method, @args) = @_; my $res = $ua->$method(@args); #print STDERR "Checking for $args[0]\n"; if ($res->is_success) { $content = $res->content } else { die $res->status_line . "\n" } }

Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!


In reply to Re: Secure Site Login with Mechanize by naikonta
in thread Secure Site Login with Mechanize by lazybowel

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-18 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found