Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^4: Split file, first 30 lines only

by wrkrbeee (Scribe)
on Mar 01, 2017 at 20:01 UTC ( [id://1183318]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Split file, first 30 lines only
in thread Split file, first 30 lines only

Hi Hippo, your answer above states "If you don't want to download the whole file then that's a different matter entirely and would require use of a technique such as HTTP Ranges." I've tried to Google HTTP ranges but no luck. Any ideas where I get a sense of what have in mind? Nothing else seems to work (just trying to nab the few lines from web pages). Thanks!

Replies are listed 'Best First'.
Re^5: Split file, first 30 lines only (HTTP Ranges)
by hippo (Bishop) on Mar 02, 2017 at 09:38 UTC

    Ranges are documented in section 14.35 of the HTTP RFC. They allow an HTTP client to request only part (or parts) of the resource which would ordinarily be retrieved in full (or in server-chosen chunks) from the server.

    The RFC only mandates byte-count ranges so you should use that instead of lines in order to be portable. However if you are after the first 30 lines of a 50,000 line response then just pick a large enough byte range that you will likely retrieve at least your 30 lines and if fewer lines are returned you can issue subsequent requests until you have all the data you require.

      is not what :read_size_hint => $bytes of LWP::UserAgent is for?

      or in other words: is :read_size_hint the implementation of the HTTP ranges you are talking about?

      If i remember the hint word is there because there is no guarantee that the chunk retrieved will be exactly $bytes long: it is merely a hint, which LWP may disregard.

      Even with such recomendation i remember i read somewhere, the following example seems to demonstrate that data is retrieved exactly by chunks of desired length, even for bizarre values of $bytes

      Obviosly the last chunk will be of arbitrary lenght.

      thanks

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
        is :read_size_hint the implementation of the HTTP ranges you are talking about?

        It is one implementation of it but requires careful use of the callback. As you can see from your code, it downloads all the content but in chunks of your specified size. Since the object here is rather only to download the minumum amount of data from the server, the callback must die to stop the subsequent chunks being retrieved. eg:

        #!/usr/bin/env perl use strict; use warnings; use utf8; use strict; use warnings; use LWP::UserAgent; # Modify these three variables only to suit my $url = 'http://www.gutenberg.org/ebooks/1533.txt.utf-8'; # M +acBeth my $wantlines = 30; # Retrieve this number of lines my $bytes = 256; # Chunk size to download my $firstndata; my $linecount = 0; my $chunkcount = 0; sub add_chunk { my ($chunk, $res, $proto) = @_; $firstndata .= $chunk; $linecount += () = $chunk =~ /\n/g; $chunkcount++; die if $linecount >= $wantlines; } my $ua = LWP::UserAgent->new; my $res = $ua->get ($url, ':content_cb' => \&add_chunk, ':read_size_hi +nt' => $bytes); print "Retrieved $linecount lines in $chunkcount chunks from $url:\n\n +$firstndata\n";

        If you run this, you will see that it retrieves slightly more than the 30 lines required, but substantially less than the full text. This seems like a reasonable compromise and is, of course, tunable by the user to the specific task at hand by varying $wantlines and $bytes.

        thank you Discipulus!
      Thank hippo!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1183318]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-24 17:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found