Getting more out of LWP::Simple

Update: Upon request, a copy of this has been posted to Tutorials. Thanks all for the response!

Introduction

I have lately had reason to use LWP::Simple for lots of small tasks, including: downloading a pdf on the command line without wget, since my browser didn't get it right, fetching the Chatterbox XML ticker and doing lots of other small tasks. None of which would have been quite as easy without LWP::Simple, although there are of course alternatives. But, as I'm sure you have heard, it is recommended to "Do the simplest thing that could possibly work". Which I feel, in perl programming, is often using the package named XXX::Simple.

While doing this, I've "discovered" a few neat tricks that makes its use even simpler, or more effective, and I'd though I'd also share a few other things that might not be a given (like the HEAD part) for those not so familiar with HTTP and web servers.

I'm sure there is more that I am missing out here, but these things made life easier on me, at least. So here goes:

Read the documentation

Sounds like a given, but it is easy to neglect - or to think that one remembers everything. lwpcook, LWP::Simple, LWP::UserAgent and LWP are good places to look. Or just type perldoc name on your command line - you should have this utility bundled with your perl distribution.

This mini tutorial assumes that you have some basic knowledge of using LWP::Simple.

Export the UserAgent

A poorly documented feature of LWP::Simple is that it supports exporting the LWP::UserAgent object it uses to fetch with.

Why would you want to do that? Well, the default timeout for LWP::Simple is the same as for LWP::UserAgent, that is 180 seconds, or three minutes. This might often be way too long. In one real life example of mine, I had a small script going live every minute, fetching something from the web - such a timeout might mean that I have several copies of the script running simultanelously, potentially accessing the same log files or something similar. There are other ways to work around this, of course, with setting alarms, or implement file locking. But it made no sense either way, since if the page didn't respond within 30 seconds, it was probably down anyways.

This code will take care of this problem:

# Note that if you do this, you must explicitly
# export everything you want to use:
use LWP::Simple qw($ua get);

$ua->timeout(30);

my $html = get $webpage || die "Timed out!";
[download]

Another thing you might want to do is change your reported useragent:

$ua->agent('My agent/1.0');
[download]

If you want to do several requests, of which the first should include a login, or something else stateful which uses cookies, you can even attach a cookiejar to use with LWP::Simple:

use LWP::Simple qw($ua get);
use HTTP::Cookies;

$ua->cookie_jar(HTTP::Cookies->new);

get $webpage . $login_string;

my $logged_in_page = get $webpage . $private_page;
[download]

And, as usual with cookiejars, you can of course specify a file to save the cookies in, between invokations of the script.

As you can see, this opens up some possibilities for extra tweaking. But why not use LWP::UserAgent then, instead? Well, simply because this way is so much simpler if you only need those small extras. The corresponding LWP::UserAgent example for timeout looks like this:

use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

$ua->timeout(30);

$request = HTTP::Request->new('GET', $webpage);

$response = $ua->request($request);

my $html = $response->content;
[download]

As you can see, lots more typing. See LWP::UserAgent for all possibilities you have on this.

Update: I added the next section after gettin inspiration from arunhorne's node below. I think that there is a simpler way to do this, again in some cases. It is somewhat related to the previous section, since you might use the UserAgent for this.

Use environment variables to set proxies

As pointed out by arunhorne below, it is sometimes necessary to use a proxy because you are behind a firewall. Like suggested, one can always use the exported UserAgent to cope with this, by setting $ua->proxy.

But upon init, LWP::Simple will also call $ua->env_proxy as described at LWP::UserAgent, which means that if you use the same script somewhere else, or several LWP::Simple scripts, it might be easier to simply set your environment variables, like http_proxy for all http requests. However, if the proxy requires credentials, I don't think that is possible to do via the environment, in which case you must resort to the UserAgent way of doing things.

This is an easy way to set your proxy, on that machine, for all eternity - without modifying the script. That may, or may not be what you want. :)

The docs on LWP::UserAgent mentions how to set these on *NIX based platforms, I just want to add that on Windows, the command you want is set - try to type set /? to get some instructions. Or just set it the GUI way, which should be somewhere below the control panel.

Use LWP::Simple on the command line

This is well documented in lwpcook, but it is worth mentioning. *NIX people usually have the excellent wget program to take care of this stuff. It is probably available somewhere for other platforms as well, and it is included in cygwin as well (though not by default).

But if you know how to use LWP::Simple on the command line, and you have perl available (you do have perl on all your computers, right?) then you already know how to fetch files and pages on any platform. This is a very nice tool to have in ones toolbox.

You could even use the chatterbox from the command line, using any of these (depending on if you are more fluent in XML or HTML) to read it:

perl -MLWP::Simple -e "getprint 'http://perlmonks.org?node_id=145587'"

perl -MLWP::Simple -e "getprint 'http://perlmonks.org?node=showchatmes
+sages&displaytype=raw'"
[download]

...and something like this to post your own messages:

perl -MLWP::Simple -e "get 'http://perlmonks.org?op=login&user=Dog and
+ Pony&passwd=doNotUseThisPW&op=message&message=Hi it is me on the com
+mand line!'"
[download]

Although, for your sanitys sake, I do not really recommend it... :)

Try using get to post data into forms

Many forms out there on the web doesn't really need a POST request to accept your data. One good example is the regular search box on the top of the perlmonks pages; it expects the field 'node' to contain some search words. But it doesn't care if it is a GET or POST, even though the form itself uses a POST.

This code works just fine:

my $words = 'LWP::Simple tutorial'
my $html = get "http://www.perlmonks.org/index.pl?node=$words";
[download]

What it is really about is of course that it is possible to do a check on the server if it is really a post that is coming our way or not. PerlMonks has wisely chosen not to do so, thus making it much simpler for people to use this ability - not to mention that arbitrary linking such as [some words] uses this to link as best as it can. Very useful for names in the chatterbox for instance.

The way to POST data described in lwpcook is not very hard or complex either, but this way still beats it.

Use the HTTP status codes when possible

LWP::Simple also exports the HTTP::Status constants and procedures, as documented. The author notes that this is a mistake and makes LWP::Simple slower, but while it is there, we should really take advantage of it for the functions that makes it possible.

The functions in LWP::Simple that return a HTTP status code are getprint, getstore and mirror. This is for example the number '200' for a succesful fetch, or '404' for 'Page not found', as documented in HTTP::Status. We can use these numbers to determine the success or failure of a fetch.

But it is simpler than that, unless we have special needs, as we also get the functions is_success and is_error exported, that we can feed these numbers to and get a quick answer to if everything is fine or not:

my $response_code = mirror $webpage, 'webpage.html';

die "Bad response $response_code" unless is_success($response_code);
[download]

Note: If you do the trick with exporting the UserAgent above, you will need to explicitly export these functions too.

Use head to determine if a site is up

This is somewhat covered in lwpcook, but it doesn't mention that this is much easier on the network traffic and the web server (if that is an issue). So if all you want to do is check if the server is responding, or if the document exists, without actually fetching it - use the function head:

use LWP::Simple;

print "$webpage exists and server is up!\n" if (head($webpage));
[download]

It is also worth noting that pinging the server will not tell you if the web server is up, so this is the way you want to use for this.

Of course, you also get some information in the form of a list from head if you want it. Namely Content-type, document length, last modified time, expiry date and server name, in that order.

my @headers = head $webpage;

print join "\n", @headers;
[download]

Will print this data for the webpage of your choice.

Drawbacks

Well, none that aren't advertised in the documentation, but there are some things that one may or may not like:

LWP::Simple might seem limited. Well, it is, by design. Of course it would be nice to be able to do POSTS as easy, but I've noticed that I rarely actually need that, and there are still ways to do it when you do need it. LWP::Simple seems to cover most of the basic cases you stumble upon.
LWP::Simple pollutes the name space. Indeed it does, and that tends to be something I don't really like. If I see a subroutine call 'get', how do I know if it is mine or someone elses? This can be a problem when using someone elses code, or your own old. You can "solve" this by document the call with a comment, or by always calling your own subs with a prepending '&'. LWP::simple tends (for me) to show up in small scripts and oneliners, so then it isn't very hard to see what is going on, and it makes things much easier. It also allows you to easily use LWP::Simple on the commandline.

Final words

As you can see, there is lots and lots to gain by using LWP::Simple, and by using it right. Simple doesn't always have to mean (too) limited. I hope this has been a help in your web programming and/or automation tasks - sometimes, simple is all it takes.

You have moved into a dark place.
It is pitch black. You are likely to be eaten by a grue.

Comment on Getting more out of LWP::Simple Select or Download Code

Replies are listed 'Best First'.
LWP::Simple UserAgent and Fire-walls by arunhorne (Pilgrim) on May 20, 2002 at 17:47 UTC
Nice article and the minimalist nature of LWP::Simple is really in the spirit of Perl (IMO). It also has a number of very handy features like being able to save a webpage to directly to a file... really useful for retrieving the biological data files that form the basis of my work (as they are stored as plain text). With regards to exporting the user agent from LWP::Simple this is always a good idea. However it becomes essential (in that you need a ua of some kind and it would be silly to create another ;D) if you are behind a firewall. In my post Incase You Need to Use a Proxy with LWP I have included some code snippets to show how I used LWP::Simple from behind a firewall. This may be necessary for many who are stuck behind a corporate/instiutions firewall. Hope that adds to the material. Arun	[reply]
(wil) Re: Getting more out of LWP::Simple by wil (Priest) on May 20, 2002 at 17:40 UTC
Great article Dog and Pony! That made for some interesting reading and you covered a lot of ground. Have you considered converting this into a tutorial? I can see where I would of benefited of such an article when I ran into problem with the LWP::* branch of modules a few weeks back. Use head to determine if a site is up The only thing I would add here is that you do not always get a valid HEAD response back from some web servers. This really screwed me for days when trying to write a Link Checker a few weeks back and I was directed to crzyinsomniac's LWP head replacement which works great and much more reliable than solely relying on the HEAD response. - wil	[reply]
Re: Getting more out of LWP::Simple by ignatz (Vicar) on May 21, 2002 at 20:27 UTC
D&P, I think that there is a lot of great information in here that would make this node a very welcome addition to Tutorials. Tye has informed me that there is no such thing as "move to tutorials" as I had requested and that the best thing to do is to ask the author to repost it under tutorials, and so I am doing so. ()-() \"/ `	[reply]
Re: Re: Getting more out of LWP::Simple by Dog and Pony (Priest) on May 23, 2002 at 07:28 UTC
Done. I've reposted it under Tutorials here. I'd like to add a big thank you, I am very flattered by this response. :) You have moved into a dark place. It is pitch black. You are likely to be eaten by a grue.	[reply]
Re: Re: Re: Getting more out of LWP::Simple by Anonymous Monk on Mar 23, 2003 at 20:55 UTC
This looks exactly like what i need, but being a bit of a perl novice I'm still not able to use this advice as I'd like. I am trying to write a web page that will allow the surfer to enter a url, fetch the page content, then process the content (partially translate it into German.), and produce a new copy of the now modified page. This would be fine if all the pages the surfer wanted were behind the university firewall were my proto-script lives. Of course, I want them to be able to access the entire web. If anyone could explain why the LWP::UserAgent method below doesn't solve my problem I'd be very grateful. It's for intermediate learners of english Toby ###################################### #!/usr/local/bin/perl use URI; use LWP::Simple; print "Content-type:text/html\n\n"; $ua->proxy('http', 'http://$myproxy:$myport'); my $content = get("http://news.bbc.co.uk"); #Store the output of the web page (html and all) in content if (defined $content) { #$content will contain the html associated with the url mentioned +above. print $content; } else { #If an error occurs then $content will not be defined. print "Error: Get stuffed"; } ############################################ [download] Edit: Added `<code>` tags. larsen	[reply] [d/l] [select]
Re^4: Getting more out of LWP::Simple (import it) by Aristotle (Chancellor) on Mar 23, 2003 at 22:44 UTC
Re: Re^4: Getting more out of LWP::Simple by Anonymous Monk on Mar 24, 2003 at 12:14 UTC
Re: Re: Re: Re: Getting more out of LWP::Simple by jasonk (Parson) on Mar 23, 2003 at 21:18 UTC
Re^5: Getting more out of LWP::Simple by Aristotle (Chancellor) on Mar 23, 2003 at 23:02 UTC


Welcome to the Monastery
	PerlMonks