Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

download JPG series with error-handling

by Anonymous Monk
on Mar 28, 2005 at 02:05 UTC ( [id://442730]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Let's say that hypothetically I'm downloading JPGs from a website.

My script is downloading a series of JPGs:

www.website.com/1/1.jpg www.website.com/1/2.jpg www.website.com/1/3.jpg [...] www.website.com/2/1.jpg www.website.com/2/2.jpg www.website.com/2/3.jpg
and so on.

And let's say that some of those JPGs, despite their existence being implied, might not exist.

When my code gets to a "bad patch", say there's no 42 folder, it goes on downloading

www.website.com/42/1.jpg www.website.com/42/2.jpg www.website.com/42/3.jpg
and saving them to disk as .JPG files, but in reality it's grabbing the 404 message from the website.

My question is, what's the easiest way to check for the case when the URL "www.website.com/42/1.jpg" is sending back an HTML "sorry" page, not a JPG?

I'm just using LWP::Simple at the moment with getstore().

Should I do a HEAD request first? Or just get the URL anyway, check its MIME type, and only store if it's IMG/JPG? Or can I trust that a site will always have a 404 response code for a URL that doesn't exist and go by that? What's most efficient? Will I start using LWP::UserAgent instead of Simple?

Replies are listed 'Best First'.
Re: download JPG series with error-handling
by moot (Chaplain) on Mar 28, 2005 at 02:18 UTC
    If you really don't want to make the leap to LWP::UserAgent, what's wrong with LWP::Simple's is_success or is_error methods? Or you could check the response code returned by getstore.

    Update: I wouldn't put money on the server sending back a 404 for a missing image, although any half-decent webmaster will do this (in addition to possibly supplying an error document), but I think you could rely on it for most applications. If you find this is not the case, I would think testing the mime type of the returned document would be more efficient than a HEAD request potentially followed by a GET.

      410 (Gone) is another valid "this doesn't exist" response, and 403 might also be important. You might as well just detect any error return--the design might change in the future, after all. (HTTP errors start with a 4 or 5.)

      =cut
      --Brent Dax
      There is no sig.

        To be really safe here, you might check for a success code (I think it's 202, but don't " me) rather than trying to think of all the possible error codes you might get.

        Also, just to throw another wrench in, just checking MIME type might not be good enough. What if the error "page" you get back is itself actually a jpeg image? I've seen that before. Don't know a way around it unless you are also checking the error code (and the webmaster configured the server sensibly to send back error codes).

        --DrWhy

        "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

      I would definitely check is_error first, but as mentioned before many times a missing file won't properly return a 404 code. Normally this is because the webmaster put a full URL in the ErrorDocument directive, which causes Apache to send a 302 response. What I've done before iss checked the file size. I could be fairly confideent that the images would be 80K or more, while an error page isn't likey to be more than 10K-20K. So anything under 50K is assumed an error, anything over 50K is assumed to be the image.
Re: download JPG series with error-handling
by ikegami (Patriarch) on Mar 28, 2005 at 02:20 UTC

    You could use LWP::UserAgent's get method (my $response = $ua->get('http://...')), and check $response->is_success. In case the web server incorrectly returns success, you could also try checking the content type ($response->content_type) against m/^image\//.

Re: download JPG series with error-handling
by bart (Canon) on Mar 28, 2005 at 10:51 UTC
    (...) but in reality it's grabbing the 404 message from the website.

    My question is, what's the easiest way to check for the case when the URL "www.website.com/42/1.jpg" is sending back an HTML "sorry" page, not a JPG?

    I'm just using LWP::Simple at the moment with getstore().
    No, that's not correct. LWP::Simple's getstore will not save a file if the status code is not 200 OK.

    Just check the return value of getstore(), it'll simply tell you whether the fetch has failed, or whether it saved anything:

        getstore($url, $file)
           Gets a document identified by a URL and stores it in the file. The
           return value is the HTTP response code.
    
    Just check if its return value is equal to 200.

    If the files did indeed get saved and they're HTML pages with error messages, then the webserver is misbehaving, and it returned a "200 OK" status message regardless. No other way of checking the webserver's status you can think of, would make a difference.

    You can still snoop the saved file, for example using the command line utility file on Unixy systems — perhaps Cygwin has it ported to Windows.

    If this is for Windows, and you can't find file, but you do have ImageMagick installed, then its command line utility identify can recognize whether you do have valid JPEG files.

      Perl implementation of the file utility available from the Perl Power Tools project.

Re: download JPG series with error-handling
by ambs (Pilgrim) on Mar 28, 2005 at 12:38 UTC
    A complete differet approach would be downloading the file and then checking its type using the unix file command or the File::Type Perl module.

    Alberto Simões

Re: download JPG series with error-handling
by inq123 (Sexton) on Mar 28, 2005 at 15:32 UTC
    I thinking checking MIME type is better. But as another monk pointed out, you might want to check the returned file size to make sure it's not a generic image representing error. One could minimize the possibility of accidentally deleting non-error images of the same size by storing the names of the suspicious images and delete them once two or more of them were found in the same directory.
Re: download JPG series with error-handling
by reasonablekeith (Deacon) on Mar 29, 2005 at 08:29 UTC
    If you're just getting the response and printing it, you might want to consider using LWP::Simple's mirror functions. It'll handle writing your file to disk, and return you the response code too...
    my $res_code = mirror('http://www.yadaa.com/file.jpg', './file.jgp');
Re: download JPG series with error-handling
by Anonymous Monk on Mar 29, 2005 at 08:51 UTC
    I've been having a problem with my web site. Some user keeps ripping my copyrighted images very quickly which is bringing my web server to its knees.


    Is there any kind of mod_perl handler I can create to prevent a user from ripping my sequentially named images?

    Thanks.

Re: download JPG series with error-handling
by ihb (Deacon) on Mar 29, 2005 at 10:30 UTC

    When I do stuff similar to this I usually do it in a one-liner so I tend to want to keep it short. I use different approaches depending on the situation, but sometimes I just keep a simple log (like print getstore($url), " $url";) which I then can use to filter away bad files (I don't want to have to download the same files twice). If you for some reason don't care for File::Type then a simple way I've used is the -T/-B file test functions to see if it's a text or binary file. If I want images only I check for -B success.

    If you have any favourite command line website downloader you can easily generate an page indexing all the other pages or images you want to download and let the program do the job for you.

    Hope this helps,
    ihb

    See perltoc if you don't know which perldoc to read!

Re: download JPG series with error-handling
by mhacker (Sexton) on Mar 29, 2005 at 11:51 UTC
    You might want to start checking for the existance of directories as well, depending on the max index you're trying. If the "42" dir doesn't exist, it doesn't make sense to try to get images 1.jpg through 1000.jpg from there either. Not sure if there is a server-independent way of checking this though.
      Forgot to mention that a very easy way to download URLs with numbers in them is with the "curl" utility.
      curl -O http://www.website.com/images[1-4]/pic[001-250].jpg
      will download images1/pic001.jpg, images1/pic002.jpg etc up to images4/pic250.jpg. Will also handle cookies etc for you if you want it to.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://442730]
Front-paged by Cody Pendant
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-06-13 22:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.