|Perl: the Markov chain saw|
Getting more out of LWP::Simpleby Dog and Pony (Priest)
|on May 20, 2002 at 16:28 UTC||Need Help??|
I have lately had reason to use LWP::Simple for lots of small tasks, including: downloading a pdf on the command line without wget, since my browser didn't get it right, fetching the Chatterbox XML ticker and doing lots of other small tasks. None of which would have been quite as easy without LWP::Simple, although there are of course alternatives. But, as I'm sure you have heard, it is recommended to "Do the simplest thing that could possibly work". Which I feel, in perl programming, is often using the package named XXX::Simple.
While doing this, I've "discovered" a few neat tricks that makes its use even simpler, or more effective, and I'd though I'd also share a few other things that might not be a given (like the HEAD part) for those not so familiar with HTTP and web servers.
I'm sure there is more that I am missing out here, but these things made life easier on me, at least. So here goes:
Read the documentation
Sounds like a given, but it is easy to neglect - or to think that one remembers everything. lwpcook, LWP::Simple, LWP::UserAgent and LWP are good places to look. Or just type perldoc name on your command line - you should have this utility bundled with your perl distribution.
This mini tutorial assumes that you have some basic knowledge of using LWP::Simple.
Export the UserAgent
A poorly documented feature of LWP::Simple is that it supports exporting the LWP::UserAgent object it uses to fetch with.
Why would you want to do that? Well, the default timeout for LWP::Simple is the same as for LWP::UserAgent, that is 180 seconds, or three minutes. This might often be way too long. In one real life example of mine, I had a small script going live every minute, fetching something from the web - such a timeout might mean that I have several copies of the script running simultanelously, potentially accessing the same log files or something similar. There are other ways to work around this, of course, with setting alarms, or implement file locking. But it made no sense either way, since if the page didn't respond within 30 seconds, it was probably down anyways.
This code will take care of this problem:
Another thing you might want to do is change your reported useragent:
And, as usual with cookiejars, you can of course specify a file to save the cookies in, between invokations of the script.
As you can see, this opens up some possibilities for extra tweaking. But why not use LWP::UserAgent then, instead? Well, simply because this way is so much simpler if you only need those small extras. The corresponding LWP::UserAgent example for timeout looks like this:
As you can see, lots more typing. See LWP::UserAgent for all possibilities you have on this.
Update: I added the next section after gettin inspiration from arunhorne's node below. I think that there is a simpler way to do this, again in some cases. It is somewhat related to the previous section, since you might use the UserAgent for this.
Use environment variables to set proxies
As pointed out by arunhorne below, it is sometimes necessary to use a proxy because you are behind a firewall. Like suggested, one can always use the exported UserAgent to cope with this, by setting $ua->proxy.
But upon init, LWP::Simple will also call $ua->env_proxy as described at LWP::UserAgent, which means that if you use the same script somewhere else, or several LWP::Simple scripts, it might be easier to simply set your environment variables, like http_proxy for all http requests. However, if the proxy requires credentials, I don't think that is possible to do via the environment, in which case you must resort to the UserAgent way of doing things.
This is an easy way to set your proxy, on that machine, for all eternity - without modifying the script. That may, or may not be what you want. :)
The docs on LWP::UserAgent mentions how to set these on *NIX based platforms, I just want to add that on Windows, the command you want is set - try to type set /? to get some instructions. Or just set it the GUI way, which should be somewhere below the control panel.
Use LWP::Simple on the command line
This is well documented in lwpcook, but it is worth mentioning. *NIX people usually have the excellent wget program to take care of this stuff. It is probably available somewhere for other platforms as well, and it is included in cygwin as well (though not by default).
But if you know how to use LWP::Simple on the command line, and you have perl available (you do have perl on all your computers, right?) then you already know how to fetch files and pages on any platform. This is a very nice tool to have in ones toolbox.
You could even use the chatterbox from the command line, using any of these (depending on if you are more fluent in XML or HTML) to read it:
...and something like this to post your own messages:
Although, for your sanitys sake, I do not really recommend it... :)
Try using get to post data into forms
Many forms out there on the web doesn't really need a POST request to accept your data. One good example is the regular search box on the top of the perlmonks pages; it expects the field 'node' to contain some search words. But it doesn't care if it is a GET or POST, even though the form itself uses a POST.
This code works just fine:
What it is really about is of course that it is possible to do a check on the server if it is really a post that is coming our way or not. PerlMonks has wisely chosen not to do so, thus making it much simpler for people to use this ability - not to mention that arbitrary linking such as [some words] uses this to link as best as it can. Very useful for names in the chatterbox for instance.
The way to POST data described in lwpcook is not very hard or complex either, but this way still beats it.
Use the HTTP status codes when possible
LWP::Simple also exports the HTTP::Status constants and procedures, as documented. The author notes that this is a mistake and makes LWP::Simple slower, but while it is there, we should really take advantage of it for the functions that makes it possible.
The functions in LWP::Simple that return a HTTP status code are getprint, getstore and mirror. This is for example the number '200' for a succesful fetch, or '404' for 'Page not found', as documented in HTTP::Status. We can use these numbers to determine the success or failure of a fetch.
But it is simpler than that, unless we have special needs, as we also get the functions is_success and is_error exported, that we can feed these numbers to and get a quick answer to if everything is fine or not:
Note: If you do the trick with exporting the UserAgent above, you will need to explicitly export these functions too.
Use head to determine if a site is up
This is somewhat covered in lwpcook, but it doesn't mention that this is much easier on the network traffic and the web server (if that is an issue). So if all you want to do is check if the server is responding, or if the document exists, without actually fetching it - use the function head:
It is also worth noting that pinging the server will not tell you if the web server is up, so this is the way you want to use for this.
Of course, you also get some information in the form of a list from head if you want it. Namely Content-type, document length, last modified time, expiry date and server name, in that order.
Will print this data for the webpage of your choice.
Well, none that aren't advertised in the documentation, but there are some things that one may or may not like:
As you can see, there is lots and lots to gain by using LWP::Simple, and by using it right. Simple doesn't always have to mean (too) limited. I hope this has been a help in your web programming and/or automation tasks - sometimes, simple is all it takes.
You have moved into a dark place.
It is pitch black. You are likely to be eaten by a grue.