Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

WWW::Mechanize (and LWP) should use CONNECT for HTTPS request when a proxy is used?

by spunk (Acolyte)
on Dec 10, 2006 at 07:22 UTC ( #588880=perlquestion: print w/replies, xml ) Need Help??

spunk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I've developed some WWW::Mechanize scripts that work without any problems, but now I am trying to route these scripts through a proxy (Privoxy v3.0.5 Beta running on my local Linux machine) and I'm finding that for all HTTPS requests, I always get 500 response codes. If I remove the Privoxy proxy from my Perl script, everything works. If I keep the proxy and just go to HTTP sites, everything works. If I configure my browser to go through the proxy and load an HTTPS page, everything works.

So to summarize so far...
HTTP request from WWW::Mechanize -> Privoxy => works!
HTTPS request from browser -> Privoxy => works!
HTTPS request from WWW::Mechanize -> direct connection to internet => works!
HTTPS request from WWW::Mechanize -> Privoxy => does NOT work!

Looking at Privoxy's detailed log file, the first sign of things going wrong appears to be that WWW::Mechanize passes a GET request to the proxy. The browsers do not do this, they use CONNECT I really don't know for sure if this is correct since CONNECT isn't really specified in the W3 HTTP 1.1 spec that I Googled.

My hypothesis is that Firefox has got it right and that WWW::Mechanize is not smart enough to use CONNECT instead of GET when requesting HTTPS pages throught a proxy.

My questions to the group are...
1) Does all of this sound right?
2) How would I force a CONNECT from either WWW::Mechanize or LWP in this cirumstance? Nothing is mentioned in any of the docs I've seen. Grepping the code didn't reveal anything to me either.

Here's my code....

#!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTTP::Cookies; use LWP; use LWP::DebugFile; require HTTP::Request; sub main { my $cookie_jar = HTTP::Cookies->new( file => 'cookies.dat', autosave => 1, hide_cookie2 => 1 ); my $bot = WWW::Mechanize->new; $bot->max_redirect(100); $bot->cookie_jar($cookie_jar); $bot->proxy(['http', 'https'], 'http://192.168.250.11:8118/'); $bot->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1. +8.0.3) Gecko/20060426 Firefox/1.5.0.3'); my $url = "https://login.yahoo.com"; #my $url = "https://us.etrade.com"; my $response = $bot->get($url); my $content = $bot->content; } &main


Here is the Privoxy log looks like when I use my perl script...
Dec 09 22:53:30 Privoxy(b7f856c0) Info: Privoxy version 3.0.5 Dec 09 22:53:30 Privoxy(b7f856c0) Info: Program name: ./privoxy Dec 09 22:53:30 Privoxy(b7f856c0) Info: Listening on port 8118 on IP a +ddress 192.168.250.11 Dec 09 22:53:44 Privoxy(b7f84bb0) Header: New HTTP Request-Line: GET / + HTTP/1.0 Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: GET / HTTP/1.0 Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: Accept-Encoding: ident +ity Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: Host: login.yahoo.com Dec 09 22:53:44 Privoxy(b7f84bb0) Header: scan: User-Agent: Mozilla/5. +0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Fire +fox/1.5.0.3 Dec 09 22:53:44 Privoxy(b7f84bb0) Header: addh-unique: Host: login.yah +oo.com Dec 09 22:53:44 Privoxy(b7f84bb0) Header: Adding: Connection: close Dec 09 22:53:44 Privoxy(b7f84bb0) Request: login.yahoo.com/ Dec 09 22:53:44 Privoxy(b7f84bb0) Writing: �Dec 09 22:53:45 Pri +voxy(b7f84bb0) Writing: GET / HTTP/1.0 Accept-Encoding: identity Host: login.yahoo.com User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3 +) Gecko/20060426 Firefox/1.5.0.3 Connection: close Dec 09 22:53:47 Privoxy(b7f84bb0) Header: Adding: Connection: close Dec 09 22:53:47 Privoxy(b7f84bb0) Writing: Connection: close
Here is the LWP debug information generated...
# LWP::DebugFile logging to lwp_457baef8_5876.log # Time now: {1165733624} = Sat Dec 9 22:53:44 2006 LWP::UserAgent::new: () LWP::UserAgent::proxy: ARRAY(0x8ce0c98) http://192.168.250.11:8118/ LWP::UserAgent::proxy: http http://192.168.250.11:8118/ LWP::UserAgent::proxy: https http://192.168.250.11:8118/ LWP::UserAgent::request: () HTTP::Cookies::add_cookie_header: Checking login.yahoo.com for cookies HTTP::Cookies::add_cookie_header: Checking .yahoo.com for cookies HTTP::Cookies::add_cookie_header: Checking yahoo.com for cookies HTTP::Cookies::add_cookie_header: Checking .com for cookies LWP::UserAgent::send_request: GET https://login.yahoo.com LWP::UserAgent::_need_proxy: Proxied to http://192.168.250.11:8118/ LWP::Protocol::http10::request: () LWP::Protocol::http10::request: S>0 "GET https://login.yahoo.com HTTP/ +1.0\x0D\x0A" LWP::Protocol::http10::request: S>+ "Accept-Encoding: identity\x0D\x0A +" LWP::Protocol::http10::request: S>+ "Host: login.yahoo.com\x0D\x0A" LWP::Protocol::http10::request: S>+ "User-Agent: Mozilla/5.0 (Windows; + U; Windows NT 5.1; en-US; rv:1. 8.0.3) Gecko/20060426 Firefox/1.5.0.3\x0D\x0A\x0D\x0A" LWP::Protocol::http10::request: reading response # Time now: {1165733627} = Sat Dec 9 22:53:47 2006 LWP::Protocol::http10::request: S>0 "Connection: close\x0D\x0A\x0D\x0A +" LWP::Protocol::http10::request: HTTP/0.9 assume OK LWP::Protocol::collect: read 21 bytes LWP::UserAgent::request: Simple response: OK

Here is what the Privoxy log file looks like when a browser (Firefox in this case) requests Yahoo's login page through the proxy...
Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: CONNECT login.yahoo.co +m:443 HTTP/1.1 Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: User-Agent: Mozilla/5. +0 (X11; U; Linux i686; en-US; rv:1.8.0.8) Gecko/20061109 CentOS/1.5.0 +.8-0.1.el4.centos4 Firefox/1.5.0.8 pango-text Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: Proxy-Connection: keep +-alive Dec 09 22:56:06 Privoxy(b7f84bb0) Header: scan: Host: login.yahoo.com Dec 09 22:56:06 Privoxy(b7f84bb0) Header: crumble crunched: Proxy-Conn +ection: keep-alive! Dec 09 22:56:06 Privoxy(b7f84bb0) Header: addh-unique: Host: login.yah +oo.com:443 Dec 09 22:56:06 Privoxy(b7f84bb0) Header: Adding: Connection: close Dec 09 22:56:06 Privoxy(b7f84bb0) Request: login.yahoo.com:443/ Dec 09 22:56:06 Privoxy(b7f84bb0) Writing: �Dec 09 22:56:09 Pri +voxy(b7f84bb0) Writing: HTTP/1.0 200 Connection established Proxy-Agent: Privoxy/3.0.5 (...encrypted traffic follws.)

I am a real loss for what to do next, any help would greatly be appreciated. So many sites have enctypted login pages that this impact almost all of the sites that I want to automate.

Replies are listed 'Best First'.
Re: WWW::Mechanize (and LWP) should use CONNECT for HTTPS request when a proxy is used?
by Anonymous Monk on Dec 10, 2006 at 09:23 UTC
    http is not https, try
    ->proxy('https' , 'https://192.168.250.11:8118/');
      Hello,

      Thank you for the response. I agree, HTTP is not HTTPS and I had not thought to try this. But changing this line of code only serves to confuse Perl about where the proxy is located. The proxy no longer sees any traffic and the script just hangs without doing anything. I haven't confirmed this with a packet sniffer, but I'd guess that this change directs all outgoing HTTPS traffic to port 443 of the proxy machine, rather than port 8118 where it should go.

      I should point out that I need both HTTP and HTTPS to go through the proxy. I can think of very few sites that are purely HTTPS. Yahoo, for instance, would use HTTPS for authentication and then move back to HTTP for most other pages.

      I think that CONNECT needs to be implemented in HTTP::Request, then WWW::Mechanize needs to test if a proxy is defined and if it's for HTTPS, then call CONNECT rather than GET. I just wish I knew that this was the right thing to do before I modify these packages...

      Thanks again for your suggestion.

        Spunk, Have you ever made this working?
Re: WWW::Mechanize (and LWP) should use CONNECT for HTTPS request when a proxy is used?
by andyford (Curate) on Dec 11, 2006 at 16:10 UTC

    I was having trouble with LWP & proxies until I found that it was better to set proxy support using the environmental variable method. The Crypt::SSLeay docs says:

    The CONNECT method is used by Crypt::SSLeay's internal proxy support.
    Definately read that whole document.

    So what you want is something like this:

    $ENV{HTTP_PROXY} = 'http://192.168.250.11:8118/'; $ENV{HTTPS_PROXY} = 'http://192.168.250.11:8118/';

    non-Perl: Andy Ford

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://588880]
Approved by McDarren
Front-paged by andyford
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2021-04-20 16:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?