http://www.perlmonks.org?node_id=966140

MorayJ has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I'm having a little difficulty understanding an aspect of either LWP::Simple or web delivery and hoped for some enlightenment

I am using LWP::Simple and the following code snippet:

$content = head("$link"); if ($content) { push (@links, $_,$link,$node); };

This then works with external web pages, but the webpages on my system delivered from work don't seem to return headers. I took the stick I run perl on home and tried on my home machine to the company pages, and they still didn't return headers. I wanted to know if it was because I was on the same network, which I'm not at home.

If I change from 'head' to 'get' then it works - but I don't want to download the full pages each time.

Does this make sense as a question, or does anyone understand what's going on?

Replies are listed 'Best First'.
Re: LWP::Simple - so they say!
by Corion (Patriarch) on Apr 20, 2012 at 09:20 UTC

    Obviously, your "work" webserver sends different data than other webservers. You will have to look at the data that actually gets sent over the network. Consider looking at LWP::Debug or wireshark.

Re: LWP::Simple - so they say!
by bart (Canon) on Apr 20, 2012 at 11:26 UTC
    Perhaps your webserver just doesn't respond to "HEAD" requests. Shit happens.

    In that case, I'd call plain "GET" but you could abort halfway. You then will have to use a callback. See getprint (in LWP::Simple) as an example, which I have reproduced here:

    sub getprint ($) { my($url) = @_; my $request = HTTP::Request->new(GET => $url); local($\) = ""; # ensure standard $OUTPUT_RECORD_SEPARATOR my $callback = sub { print $_[0] }; if ($^O eq "MacOS") { $callback = sub { $_[0] =~ s/\015?\012/\n/g; print $_[0] } } my $response = $ua->request($request, $callback); unless ($response->is_success) { print STDERR $response->status_line, " <URL:$url>\n"; } $response->code; }
    You'd have to reproduce this but of course with a different callback... With die and eval BLOCK, it might just work.

      No need to abort with max_size

      #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize; my $url = 'http://example.com'; my $ua = WWW::Mechanize->new( qw/ max_size 1 autocheck 1 show_progress 1 / ); eval { $ua->head($url); 1 } or $ua->get($url); print $ua->dump_headers; __END__
Re: LWP::Simple - so they say!
by JavaFan (Canon) on Apr 20, 2012 at 10:33 UTC
    Isn't the man LWP::Simple man page quite clear what it does? Quoting said man page:
      head($url)
         Get document headers. Returns the following 5 values if successful:
         ($content_type, $document_length, $modified_time, $expires, $server)
    
         Returns an empty list if it fails.  In scalar context returns TRUE
         if successful.
    
    Perhaps none of those headers are actually send by your works server.

    Or perhaps your server doesn't respond nicely to HEAD request. Telnet to its port, issue an proper HEAD request, and see what it returns.

Re: LWP::Simple - so they say!
by Anonymous Monk on Apr 20, 2012 at 09:21 UTC

    Does this make sense as a question, or does anyone understand what's going on?

    Not really, and you're confused :)

    I'd switch to the latest WWW::Mechanize, it seems to be more friendly

    See https://metacpan.org/module/LWP::Simple#head

    $ lwp-request -m head http://example.com 200 OK Connection: close Date: Fri, 20 Apr 2012 09:17:18 GMT Server: Apache/2.2.3 (CentOS) Vary: Accept-Encoding Content-Type: text/html; charset=UTF-8 Last-Modified: Wed, 09 Feb 2011 17:13:15 GMT Client-Date: Fri, 20 Apr 2012 09:24:41 GMT Client-Peer: 192.0.32.8:80 Client-Response-Num: 1 $ perl -MLWP::Simple -le " print for head( shift ) " http://example.c +om text/html; charset=UTF-8 1297271595 Apache/2.2.3 (CentOS)
Re: LWP::Simple - so they say!
by MorayJ (Beadle) on Apr 20, 2012 at 12:03 UTC

    Hi

    Thanks for the tips and answers

    I'm going to have to revisit this, I think. 'Mech' looks like too much...possibly I should be looking at HTTP::Response as all I want is for the server to tell me if that page is there (the links aren't on a webpage, they're in an xml document).

    I'm just going to assume that, for the small amount of pages I'm doing, the inefficiency in time etc doesn't matter. Interrupting the download is something I might follow up if the links multiply too much.

    I used the 'HTTP Headers' Chrome install as don't have telnet here, and that seemed to happily give me headers on my pages. I expect it's some quirk in what I'm doing, and maybe time will reveal it to me.

    Not helped by my ignorance of headers!

    Thanks for the help and if I refine it, I'll report back