Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

HTTP GET without LWP

by bbfu (Curate)
on Jan 13, 2001 at 04:04 UTC ( [id://51510]=perlquestion: print w/replies, xml ) Need Help??

bbfu has asked for the wisdom of the Perl Monks concerning the following question:

Okay, maybe not really a Perl question, per-se... But here goes:

I have a small program (gethttp) to make simple HTTP requests and print the response to stdout. Unfortunately, LWP is not available so it had to be done manually using IO::Socket. I copied the program from one of the Perl man pages and modified it only slightly.

For the most part, this program works just great. Every once in a while, however, I find a page (usually a CGI) that doesn't seem to work. What I get back is a 404 Error though I know the page is there because I can access using my web-browser.

I figure, there must be something going on that I'm just not getting. I can't find any documents anywhere explaining a different syntax for the HTTP GET, and I can't see anything wrong with my Perl code. I really just want to understand why it's not working and what's going on, though it might have a practical application in a project I'm working on if I can get it to work.

I'm including the code from my program below, as well as a URL that it doesn't work on (I don't know if everyone can get to the URL, since it might be set up private to UF. Let me know if you have problems.) and the response I get from them.

I would appreciate any help very much!

The Program...

#!/usr/bin/perl -w use IO::Socket; unless (@ARGV) { die "usage: $0 URL\n" } $EOL = "\015\012"; $BLANK = $EOL x 2; $sep = (@ARGV > 1) ? "-------------------\n" : ""; foreach $url ( @ARGV ) { unless($url =~ m{^http://(.*?)/}) { print "$0: invalid url: $url\n +"; next } $host = $1; $remote = IO::Socket::INET->new( Proto => "tcp", PeerAddr => $host, PeerPort => "http(80)", ); unless ($remote) { die "Cannot connect to http daemon on $host\n" +} $remote->autoflush(1); print $remote "GET $url HTTP/1.0" . $BLANK; while ( <$remote> ) { print } print "\n$sep"; close $remote; }
The Response
$ ./gethttp 'http://login.gatorlink.ufl.edu/authenticate.cgi' HTTP/1.0 404 Not Found Date: Fri, 12 Jan 2001 22:57:21 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8 OpenSSL/0.9.2b Connection: close Content-Type: text/html <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>404 Not Found</TITLE> </HEAD><BODY> <H1>Not Found</H1> The requested URL http://login.gatorlink.ufl.edu/authenticate.cgi was +not found on this server.<P> </BODY></HTML>

Replies are listed 'Best First'.
Re: HTTP GET without LWP
by sutch (Curate) on Jan 13, 2001 at 04:34 UTC
    update: because of isotope's response to this posting, I've discovered that it is not HTTP/1.1 that is the solution, but providing the Host: header that allows the server to respond with a redirect. Without LWP, you're safer and have less work if you stick to HTTP/1.0 combinded with the Host: header

    The page is not available, for whatever reason (probably because of authentication). Request http://login.gatorlink.ufl.edu/authenticate.cgi in a browser and notice that you are redirected to http://login.gatorlink.ufl.edu/retry.cgi? .

    You're making an HTTP request using HTTP/1.0. So the server responds with the "404 Not Found" page. Change your request to HTTP/1.1 and you will receive a redirect as the response:

    telnet login.gatorlink.ufl.edu 80 Trying 128.227.128.87... Connected to dir2fe1.server.ufl.edu. Escape character is '^]'. GET /authenticate.cgi HTTP/1.1 Host: login.gatorlink.ufl.edu HTTP/1.1 302 Found Date: Fri, 12 Jan 2001 23:30:14 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8 OpenSSL/0.9.2b URI: retry.cgi? Set-Cookie: UF_GatorLinkState=none; path=/; domain=.ufl.edu; Location: retry.cgi? Transfer-Encoding: chunked Content-Type: text/html be <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>302 Found</TITLE> </HEAD><BODY> <H1>Found</H1> The document has moved <A HREF="retry.cgi?">here</A>.<P> </BODY></HTML> 0
      Don't send HTTP/1.1 unless you're prepared to implement it properly. If you do, the server will expect to keep the connection open. The Host: header is supported just fine with HTTP/1.0, which will drop the connection as soon as the transfer is complete.

      --isotope
      http://www.skylab.org/~isotope/
Re: HTTP GET without LWP
by isotope (Deacon) on Jan 13, 2001 at 04:25 UTC
    Many servers won't like getting a request that includes the 'http://hostname' part of the URL. Typically only proxy servers actually accept that. You might have better luck if you change:
    m{^http://(.*?)/} #to m{^http://(.*?)(/.*)$}
    ...and then set $url = $2 so you only request the URI (/authenticate.cgi). Before anyone else lays into me, this is a very rough solution and doesn't necessarily take everything into account. This is really what LWP is designed for, as it will use RFC-compliant methods to parse the URL instead of this quick and dirty stuff.

    Update: It may be a virtual server, in which case you also need to send the Host: header in your request, like this:
    print $remote "GET $url HTTP/1.0\nHost: $host" . $BLANK;
    I strongly suggest splitting the URI, too.

    --isotope
    http://www.skylab.org/~isotope/
      That was my first thought, but breaking the URL into host and URI didn't solve the issue. If I figure anything else out, I'll post it here.

      Update: sutch seems to have hit the nail on the head. If you send "GET /authenticate.cgi HTTP/1.0" alone it errors out. The key is attaching "Host: login.gatorlink.ufl.edu" to the end of the request, before your $BLANK variable.

      while(my $url = shift @ARGV) { unless($url =~ m{^http://([A-Za-z0-9\.\-]+)/(.*)$}) { print "$0: invalid url: $url\n"; next; } my($host, $uri) = ($1, $2); my $remote = IO::Socket::INET->new(Proto => "tcp", PeerAddr => $host, PeerPort => "http(80)"); unless ($remote) { die "Cannot connect to http daemon on $host\n" } $remote->autoflush(1); print $remote "GET /$uri HTTP/1.0\nHost: $host" . $BLANK; print while(<$remote>); print "\n$sep"; close $remote; }
      Your end result might look something like that. You really should just use LWP. =)

      'kaboo

      Yes, I'd thought of that and tried it but it doesn't seem to work any better. =(

      All the pages I've tried it on accept the full URL but since you say many won't like it, I'll change it. It seems to work both ways for the ones that work. Unfortunately:

      $ ./gethttp 'http://login.gatorlink.ufl.edu/authenticate.cgi' HTTP/1.1 404 Not Found Date: Fri, 12 Jan 2001 23:27:55 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8 OpenSSL/0.9.2b Connection: close Content-Type: text/html <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>404 Not Found</TITLE> </HEAD><BODY> <H1>Not Found</H1> The requested URL /authenticate.cgi was not found on this server.<P> </BODY></HTML>

      Thanks for your help, though!

Re: HTTP GET without LWP
by dws (Chancellor) on Jan 13, 2001 at 04:41 UTC
    It's possible that the web server is configured to internally redirect the request based on HTTP request headers that your web browser is sending, but your script isn't. There are several possibilities:
    • Your browser is probably using HTTP/1.1, and is thus sending a Host: header.
    • Your browser is sending a User-Agent: header. (A likely culprit).
    • Your browser is sending an Accept: header. (A less likely culprit).
    Try adding these headers to your HTTP request.

    Update: The correct answer (HTTP/1.1 + Host:) snuck in while I was typing this.

    Update^2: An invaluable reference to have, whether you're using LWP or not, is <a href="http://www.oreilly.com/catalog/webmaster2/"Webmaster in a Nutshell (O'Reilly). It includes a complete overview of HTTP, including request and response headers.

Thanks, everyone!
by bbfu (Curate) on Jan 13, 2001 at 04:50 UTC

    I appreciate all the help from everyone!

    The problem was pretty much exactly what sutch said. I've updated the request to use HTTP/1.1 and the Host: field. I might take the advice and use the User-Agent: field as well when I read more about it.

    What I've got now (below) works. Though I realize this is a pretty primative hack, I just don't have access to LWP (at the moment).

    Again, thanks for all the help!

    #!/usr/bin/perl -w use IO::Socket; unless (@ARGV) { die "usage: $0 URL\n" } $EOL = "\015\012"; $BLANK = $EOL x 2; $sep = (@ARGV > 1) ? "-------------------\n" : ""; foreach $url ( @ARGV ) { unless($url =~ m{^http://(.*?)/(.*)$}) { print "$0: invalid url: $ +url\n"; next } $host = $1; $rest = $2; $remote = IO::Socket::INET->new( Proto => "tcp", PeerAddr => $host, PeerPort => "http(80)", ); unless ($remote) { die "Cannot connect to http daemon on $host\n" +} $remote->autoflush(1); print $remote "GET /$rest HTTP/1.1". $EOL . "Host: $host" . $BLANK +; while ( <$remote> ) { print } print "\n$sep"; close $remote; }
Re: HTTP GET without LWP
by strredwolf (Chaplain) on Jan 13, 2001 at 10:30 UTC
    You may find my WolfSkunk Proxy "wsproxy" program helpful in this regard. There's a bit of regexp code to pharse a URL to it's right components. Worth taking a look over.

    --
    $Stalag99{"URL"}="http://stalag99.keenspace.com";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://51510]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-03-29 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found