Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

LWP: Downloading First 2KB of an HTML File

by Anonymous Monk
on Feb 21, 2003 at 22:23 UTC ( #237586=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to use LWP to get a page's META tags. However, to save bandwidth, is there a way to only download the first 2KB of the file (since most pages have META tags right at the top)?
  • Comment on LWP: Downloading First 2KB of an HTML File

Replies are listed 'Best First'.
Re: LWP: Downloading First 2KB of an HTML File
by Aristotle (Chancellor) on Feb 22, 2003 at 00:17 UTC

    Of course, a Range header will fail in a lot of cases, where the webserver doesn't know ahead of time the size of the document it is going to transmit.

    In that case you can use the $ua->request($request, \&callback, 4096); form of the request method. LWP::UserAgent will then call your callback function as it downloads, passing it chunks of the specified length.

    Now the POD to LWP::UserAgent says this:

    The request can be aborted by calling die in the callback routine. The die message will be available as the "X-Died" special response header field.

    Obviously, the easiest route in this case is to request chunks of 2kb, and unconditionally die whenever the callback is called. That will do exactly what you want: load 2kb, then abort the request.

    my $data; $ua->request($request, sub { $data = shift; die }, 2048);

    There you go.

    Makeshifts last the longest.

Re: LWP: Downloading First 2KB of an HTML File
by zengargoyle (Deacon) on Feb 21, 2003 at 22:51 UTC

    add a Range header to your request.

    $ GET -H 'Range: bytes=0-13' http://localhost/foo.html <html> <head> $ GET -H 'Range: bytes=0-42' http://localhost/foo.html <html> <head> <title>foo</title> <meta http$
Re: LWP: Downloading First 2KB of an HTML File
by Cabrion (Friar) on Feb 22, 2003 at 00:57 UTC
    Read up on the LWP::UserAgent's head() method. It will retrieve everything (including metatags) in the target document's header section without retrieving the body.

      Meta tags, by definition, are not in the header section (the header section is not the same as the <head> section of the page). The point of meta tags is to include information in the body of the document that really should have been in the header, when you have no way to influence how the web server builds the headers.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://237586]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2019-10-19 19:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?