Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Find images regardless of filetype extension.

by zzspectrez (Hermit)
on Aug 01, 2005 at 05:32 UTC ( [id://479800]=CUFP: print w/replies, xml ) Need Help??

I was looking for a quick way to scan a directory and subdirectories for valid image files without relying on the file extension to identify image type.

This reads small chunk of each file and identifies images by header info.

Looking for code review/suggestions. As mentioned in threads bellow, this is work is for educational purposes.

zzSPECTREz

updates

Please read following threads to be aware of limitations and possible bugs.

  • Changed !/^\.{1,2}$/ to !/^\.{1,2}\z/ to hopefully eliminate the problem with filenames name: "..\n".

use strict; use warnings; sub read_dir { my $dir = shift; if ( opendir( DIR, $dir ) ) { my @tmp = readdir(DIR); my @files = map { "$dir/$_" } grep { !/^\.{1,2}\z/ && -f "$dir +/$_" } @tmp; my @dirs = map { "$dir/$_" } grep { !/^\.{1,2}\z/ && -d "$dir +/$_" } @tmp; closedir( DIR ); return ( \@files, \@dirs ); }else{ return; } } sub glob_files { my ($dir, $recurse) = @_; my @files; if ($dir) { my ($files, $dirs) = read_dir($dir) or return; if (@$files) { push @files, @$files; } recursion: { last unless $recurse; while (@$dirs) { my $d = shift (@$dirs); my @f = glob_files("$d", 1); push @files, @f; } } return @files; }else{ return; } } sub find_images { my @files = @_; my $file; my @images; foreach $file (@files) { open FH, $file or die "Error opening [$file]: $!\n"; my $data; my $type; next unless ( -s $file > 9 ); read(FH, $data, 10) or die "Error reading from [$file]: $!\n"; if ( $data =~ /^BM/ ) { $type = 'BMP'; }elsif ( $data =~ /^GIF8[79]a/) { $type = 'GIF'; }elsif ( $data =~ /^\xFF\xD8/ ) { $type = 'JPG'; }else { $type = undef; } push @images, ( [ $file, $type ] ) if ($type); close FH or die "Error closing file [$file]: $!\n"; } return @images if ( scalar (@images) ); return; } my (@files, @images); my $recurse = 1; my $verbose = 1; my $dir = '.'; @files = glob_files($dir, $recurse); @images = find_images(@files); if (@images){ foreach my $a (@images) { if ($verbose){ print "Found [",$a->[0],"] which appears to be an image of + type ",$a->[1],".\n"; }else{ print $a->[0],"\n"; } } } if ($verbose) { print "\n\tFound ", scalar(@files), " files."; print "\n\tFound ", scalar(@images), " images.\n"; }

Replies are listed 'Best First'.
Re: Find images regardless of filetype extension.
by merlyn (Sage) on Aug 01, 2005 at 05:38 UTC
    Let's see. Reinventing File::Find (in core) and Image::Size (in CPAN). Let's leverage instead of reinvent:
    use File::Finder; use Image::Size; my @starting_points = qw(.); my %images = File::Finder ->type('f') ->gather(sub { my($x, $y, $type) = imgsize($_); $type ? ($File::Find::name, $type) : (); }, @starting_points);

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Yes, all true... But not nearly as fun, nor nearly as educational.. Actually not using any other module was the point for me creating it. Perl is my hobby, not proffession and I find many times the best way to learn is to reinvent and usually more fun. I'm not trying to leverage anything :)

      I would appreciate any comments on code quality, if you have any. I personally do not see anything wrong with reinventing the wheel as long as you know it's not necessary. However, many times it is done poorly. Hopefully that is'nt the case here.



      zzSPECTREz

      P.S. File::Finder --> interesting wrapper! I think I will have to check that out.

        But not nearly as fun, nor nearly as educational. Actually not using any other module was the point for me creating it.
        Fine, then post it here with "I know there are ways to do this with a few modules, but one of my design goals was to not use modules". Otherwise, I waste a lot of time trying to figure out why you didn't use modules, which is a silly goal for production code.
        I would appreciate any comments on code quality, if you have any.
        Well, let's just start with the inefficiencies and errors of these two lines:
        my @files = map { "$dir/$_" } grep { !/^\.{1,2}$/ && -f "$dir/ +$_" } @tmp; my @dirs = map { "$dir/$_" } grep { !/^\.{1,2}$/ && -d "$dir/ +$_" } @tmp;
        Let's see. Twice as many stats as you need (because you can get whether it's a file or a dir with one stat). Breaks on files named "..\n" (because you'll reject that with your regex match needlessly). So there's a bug and a misfeature, just in those two lines.
        I personally do not see anything wrong with reinventing the wheel as long as you know it's not necessary.
        I do. You've posted this code here. Some crazy fool is going to cargo-cult your code without paying attention to your design goals or the following commentary. And that puts more bad Perl code in the world, not good code. {sigh}

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Find images regardless of filetype extension.
by chanio (Priest) on Aug 01, 2005 at 08:13 UTC
    Thank you both: zzspectrez & merlyn for your clear examples.

    You know, even at CPANs PMs it is not easy to understand how any module works unless you read all its code. In general, pods are poor in examples. Your code is a way of illustrating the main idea.

    In order to understand the importance of most CPANs modules I would rather try to do it first by myself (without cheating) and then compare mine with one masterpiece to know how much of experience was involved in developing them.

    I guess that that is better than taking for granted that the module is valuable. (consider for example the development of CGI.pm. It might be the module with the shortest name at all CPAN!)

    { \ ( ' v ' ) / }
    ( \ _ / ) _ _ _ _ ` ( ) ' _ _ _ _
    ( = ( ^ Y ^ ) = ( _ _ ^ ^ ^ ^
    _ _ _ _ \ _ ( m _ _ _ m ) _ _ _ _ _ _ _ _ _ ) c h i a n o , a l b e r t o
    Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established
      In order to understand the importance of most CPANs modules I would rather try to do it first by myself (without cheating) and then compare mine with one masterpiece to know how much of experience was involved in developing them.
      And I actively encourage that, as long as you keep the result in the privacy of your own cubicle, or if you post it, you label it as such.

      It's one thing to post your copy of the Mona Lisa as "here's my attempt to make a copy". It's an entirely different thing to post it as "here's my original art!" and then people go "blech", and rightfully so.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

      ... CGI.pm. It might be the module with the shortest name at all CPAN!

      There are modules with shorter names on CPAN.
      Just to name a few:
      B
      DB
      GD
      IO
      Pg
      Tk

Re: Find images regardless of filetype extension.
by blazar (Canon) on Aug 01, 2005 at 15:46 UTC
    Looking for code review/suggestions. As mentioned in threads bellow, this is work is for educational purposes.
    First of all I would like to stress that I have understood your aim and point of view. However I contend that "for educational purposes" goes in the same direction of "using no modules at all". More precisely I would say that these are mostly orthogonal concepts.

    For example I know next to nothing myself about network programming. Now it happened that I had to perform a simple task of communicating to a device (a webcam) to make it do some actions. I tried with IO::Socket::INET but had a hard time with it and lately resorted to Net::Telnet which basically made my program a one liner, with all of its intelligent defaults...

    I'd still be curious to try my hand at IO::Socket::INET on that task, for educational purposes. But the situation is quite different from that suggested by your approach.

    Don't forget that: "CPAN is my programming language of choice; the rest is just syntax." (From Perl 6 is Here Today!)

    This is especially true if you're having to do with modules in the core distribution, e.g. File::Find. Now, you may want to try your hand at coding a replacement for say File::Find or any other module you like, or a specialized version of it, adapted to the application under consideration...

    Now, on to the code...

    use strict; use warnings;
    Excellent!
    sub read_dir { my $dir = shift; if ( opendir( DIR, $dir ) ) {
    And what if opendir fails?

    My fault, as duly pointed out at Re^2: Find images regardless of filetype extension.: I hadn't noticed it was inside an if clause.

    However the use of either short circuiting logical operators or of statement modifiers is so widespread in Perl and so typical of it that IMHO a full

    if ( whatever ) { # ... } else { # ... }
    structure is rarely necessary and I doubt that it would contribute significantly to readability. In any case I would like to stress that I'm expressing here a stylistic preference of mine, certainly not an absolute truth!. (My point here being that the the  else { return } is far enough from the test not to be noticed at a glance and put in relation with the former.)
    my @tmp = readdir(DIR); my @files = map { "$dir/$_" } grep { !/^\.{1,2}$/ && -f "$dir/ +$_" } @tmp; my @dirs = map { "$dir/$_" } grep { !/^\.{1,2}$/ && -d "$dir/ +$_" } @tmp;
    Here you're doing twice as many stats as necessary. You're also doing twice as many pattern matchings. Perhaps if you really wanted to follow this approach,
    my @tmp = grep !/^\.{1,2}$/, readdir(DIR);
    may have been slightly better. Also either a pre-mapping to "$dir/$_" or a chdir into $dir may have been advantageous. Incidentally, as far as the former option is concerned, you know that to do it in a really portable way you should have used File::Spec, don't you? (This is just FYI, I use stuff like "$dir/$_" myself all the time.)

    All in all, and maintaining an approach similar to yours, I may have done:

    my (@files,@dirs); for (readdir DIR) { next if $_ eq '.' or $_ eq '..'; $_="$dir/$_"; push @files, $_ if -f; push @dirs, $_ if -d _; }
    I used string comparisons because they should be faster and they are done on every cycle.
    closedir( DIR );
    Now most file and directory opening functions support lexical handles. Well, that's not strictly true (from the technical point of view), but that's a good approximation to the truth. And with a lexical dirhanle you wouldn't need an explicit close.
    sub glob_files { my ($dir, $recurse) = @_;
    Well, I can't comment on your whole logic. But rather than having $recurse flag I would try to arrange things in other to have at some point a recursion for @dirs (this hypothetical @dirs not being yours), so that when @dirs is empty simply no recursion happens...
    if (@$files) { push @files, @$files; }
    Why not
    push @files, @$files;
    instead? (I don't know which is faster, but I suspect they execute at much the same speed, and the latter is more clear, IMHO.)
    recursion: { last unless $recurse;
    Incidentally, why did you put a label there, if you didn't use it? If is't for documenting reasons, then I get your point and I may agree with you.
    while (@$dirs) { my $d = shift (@$dirs);
    No problem with this, but why not
    for (@$dirs) {
    instead?
    my @f = glob_files("$d", 1);
    No need to quote all your variables!
    }else{ return; } }
    Ditto as above!
    foreach $file (@files) { open FH, $file or die "Error opening [$file]: $!\n";
    Ditto as above wrt using lexical filehandles. Also, some disagree, but I always recommend using the three args form of open.
    next unless ( -s $file > 9 );
    You may have done this check in your file finding routine on _ so as to avoid a third stat.
    read(FH, $data, 10) or die "Error reading from [$file]: $!\n";
    You may want to just warn and go on to the next one... just an idea!
    if ( $data =~ /^BM/ ) { $type = 'BMP'; }elsif ( $data =~ /^GIF8[79]a/) { $type = 'GIF'; }elsif ( $data =~ /^\xFF\xD8/ ) { $type = 'JPG'; }else { $type = undef; }
    Ouch! It doesn't mach my PNG images, not to say a few other tenths of popular formats, not to mention more exotic ones...

    Seriously this may be a good reason for using a dedicated module...

    return @images if ( scalar (@images) ); return;
    No need for scalar. No need for the outer parentheses. No need for the inner ones either. No need for the if modifier, and no need for the second statement:
    return @images;
    is just as fine!
    my (@files, @images); my $recurse = 1;
    Huh?!? Isn't $recurse a lexical variable in your file finding sub?
    @files = glob_files($dir, $recurse); @images = find_images(@files);
    Here you're returning a possibly huge list of files to scan them subsequently for images. I would check them on the fly, that is what I would do in the File::Find 'wanted' subroutine.

      It looks like if  if ( opendir( DIR, $dir) ) fails then the sub returns undef. The code is checking the return of the call.

      what's wrong with that?

        Indeed! My fault. That's the first thing I'm editing when I start terminating that post.

      Wow... Thanks for such an exhaustive review. I am off to work so dont have time to really comment yet. But I will be back. :)

      Thanks again for such a thought out response.

      zzSPECTREz
      may have been slightly better. Also either a pre-mapping to "$dir/$_" or a chdir into $dir may have been advantageous. Incidentally, as far as the former option is concerned, you know that to do it in a really portable way you should have used File::Spec, don't you? (This is just FYI, I use stuff like "$dir/$_" myself all the time.)

      Actually, no I know there are modules/methods that should be used to clean up paths, but it isn't something I have really done as of yet. Exactly what should I be doing?? If you dont mind explaining the whats and whys. Should I be using the canonpath or catfile methods? The manpage says that canonpath does a logical cleanup.. What exactly is that?


      All in all, and maintaining an approach similar to yours, I may have done:
      
      my (@files,@dirs);
      for (readdir DIR) {
          next if $_ eq '.' or $_ eq '..';
          $_="$dir/$_";
          push @files, $_ if -f;
          push @dirs, $_ if -d _;
      }
      
      

      Thanks for your variation. The string comparison would be a better way to go in this situation. Regexes are probably my weakest point with perl, so this is something I have been overusing trying to get used to. The performace hit is good mention. thanks.


      Now most file and directory opening functions support lexical handles. Well, that's not strictly true (from the technical point of view), but that's a good approximation to the truth. And with a lexical dirhanle you wouldn't need an explicit close.

      I am almost possitive I have seen mention here or in one of the perl books that you should close filehandles and check for error code ( oops ) just as you do with calls to open... And that it is common mistake that people only check for success on open and not the close.


      Well, I can't comment on your whole logic. But rather than having $recurse flag I would try to arrange things in other to have at some point a recursion for @dirs (this hypothetical @dirs not being yours), so that when @dirs is empty simply no recursion happens...

      The point of the flag is if you do not want recursion at all, the main $recursion flag can be set to 0 and then it will only return images in the top directory.


              if (@$files) {    
                  push @files, @$files;
              }    
      
      Why not
      
              push @files, @$files;
      
      instead? (I don't know which is faster, but I suspect they execute at much the same speed, and the latter is more clear, IMHO.)

      Actually I added that test trying to search out a bug and left it in. The error was that I made the mistake of doing an explicit return undef; instead of just return; when being called in a list context. So I was returning a list of one value "undef" and then later trying to dereference the undef value... ooops.


          foreach $file (@files) {
              open FH, $file or
                  die "Error opening $file: $!\n";
      
      Ditto as above wrt using lexical filehandles. Also, some disagree, but I always recommend using the three args form of open.

      I obviously do not understand your reference to lexical filehandles. Are you sugesting I dont check for error? HuH?? I must be having a brain fart because I am not following you.

      I have heard mention that calling the three arg method is better, but have not heard good reasoning for this.


              if ( $data =~ /^BM/ ) {
                  $type = 'BMP';
              }elsif ( $data =~ /^GIF879a/) {
                  $type = 'GIF';
              }elsif ( $data =~ /^\xFF\xD8/ ) {
                  $type = 'JPG';
              }else {
                  $type = undef;
              }
      
      
      Ouch! It doesn't mach my PNG images, not to say a few other tenths of popular formats, not to mention more exotic ones... Seriously this may be a good reason for using a dedicated module...

      Here....

      • elsif ( data =~ /^\x89PNG\x0d\x0a\x1a\x0a/) { type = 'PNG'; }
      Now it checks your pngs... Happy? :) Actually, this wasnt written just for the goal of seeing how easy it would be to identify a couple common formats. Also, if you are looking just for jpegs and gifs why search for pngs? I was not trying to identify every obscure image, even some of the more popular formats in this case.


      No need for scalar. No need for the outer parentheses. No need for the inner ones either. No need for the if modifier, and no need for the second statement:
      return @images;
      is just as fine!

      That is definetly cleaner. Perls flexibility of giving different values depending on how it is called can sometimes be confusing. So I have developed a defense mechanism that whenever any errors are encountered or I am unsure on precedence order then add paranthesis or force return mode scalar. Although not necessary, I think if not tooo overboard it isnt bad habit. What do you think given that reasoning?


      Here you're returning a possibly huge list of files to scan them subsequently for images. I would check them on the fly, that is what I would do in the File::Find 'wanted' subroutine.

      Ultimately, this is what should be done. However, it was a fun adventure and I learned a few new things... Namely the use of /z instead of $ in regexes when you dont want match before the newline. Not to do an explicit return undef; because you may get unexpected results when called in list context. As well as the many thing you have pointed out..

      Thanks.....

      zzSPECTREz
        Actually, no I know there are modules/methods that should be used to clean up paths, but it isn't something I have really done as of yet. Exactly what should I be doing?? If you dont mind explaining the whats and whys. Should I be using the canonpath or catfile methods? The manpage says that canonpath does a logical cleanup.. What exactly is that?
        I meant, to be absoultely sure about portability, instead of "$dir/$_" one {c,sh}ould use File::Spec's catfile method.
        Thanks for your variation. The string comparison would be a better way to go in this situation. Regexes are probably my weakest point with perl, so this is something I have been overusing trying to get used to. The performace hit is good mention. thanks.
        Regexen rock! However people often tends to overuse them. Including me, sometimes...
        I am almost possitive I have seen mention here or in one of the perl books that you should close filehandles and check for error code ( oops ) just as you do with calls to open... And that it is common mistake that people only check for success on open and not the close.
        Well, indeed some people recommend to check the return value of close calls too. To me, that's a bit of an exaggeration. As far as regular files are concerned, that is. (More info about this topic below.)
        Well, I can't comment on your whole logic. But rather than having $recurse flag I would try to arrange things in other to have at some point a recursion for @dirs (this hypothetical @dirs not being yours), so that when @dirs is empty simply no recursion happens...
        The point of the flag is if you do not want recursion at all, the main $recursion flag can be set to 0 and then it will only return images in the top directory.
        So far, so fine. I hadn't thought of that. As I said I've not practiced much the sport of reimplementing File::Find. If it were me I would just use the latter if I wanted recursion (and I could also control the depth of the search if needed) and perhaps a simple glob if I didn't.

        Now, glob is for some reason I ignore an oft underused and underestimated function. I frequently find myself advocating its use when I see people explicitly using opendir, readdir and grepping on filenames whereas it would take care of doing all of this for them. Granted: behind the curtain it does use a module, but I hope that doesn't overwhelmingly bother you:

        $ perl -MO=Deparse -e '<*>' use File::Glob (); glob('*'); -e syntax OK
        foreach $file (@files) { open FH, $file or die "Error opening $file: $!\n";
        Ditto as above wrt using lexical filehandles. Also, some disagree, but I always recommend using the three args form of open.
        I obviously do not understand your reference to lexical filehandles. Are you sugesting I dont check for error? HuH?? I must be having a brain fart because I am not following you.
        I meant something like this:
        for my $file (@files) { open my $fh, '<', $file or die "Error opening `$file': $!\n"; # do something with $fh # ... # no need for an explit close() }
        I have heard mention that calling the three arg method is better, but have not heard good reasoning for this.
        Well, for one thing it clearly stresses at a glance what the file is being opened for. And as such is IMHO more elegant and terse. Also, consider this oversimplified example:
        $ cat foo.pl #!/usr/bin/perl use strict; use warnings; my $file=shift; open my $fh, $file or die $!; print "The contents of `$file' are:\n", <$fh>; __END__ $ ./foo.pl aaa The contents of `aaa' are: asdfdaf sfdfdd sffgsdd $ ./foo.pl '|echo "Gotcha!">foo.pl' The contents of `|echo "Gotcha!">foo.pl' are: $ cat foo.pl Gotcha!
        This is a security hole that has actually been exploited e.g. in poorly written CGI scripts. Not to say that people could not write secure programs with the two args form of open nor that the three args one is bullet proof. Indeed in any situation where it may matter, one must validate his imput to protect from malicious users, but the latter provided a first good protection for a small expense.
        Ouch! It doesn't mach my PNG images, not to say a few other tenths of popular formats, not to mention more exotic ones... Seriously this may be a good reason for using a dedicated module...
        Here....
        elsif ( data =~ /^\x89PNG\x0d\x0a\x1a\x0a/) { type = 'PNG'; }
        Now it checks your pngs... Happy? :)
        It doesn't check my TIFF and PNM images!! :-)
        That is definetly cleaner. Perls flexibility of giving different values depending on how it is called can sometimes be confusing. So I have developed a defense mechanism that whenever any errors are encountered or I am unsure on precedence order then add paranthesis or force return mode scalar. Although not necessary, I think if not tooo overboard it isnt bad habit. What do you think given that reasoning?
        I think that a concise code is generally more clear to read and understand. Well written perl code tends to be concise. Of course I'm not talking about extreme conciseness like the one people tries to achieve e.g. when golfing, as that brings into obfuscation instead. I'm talking about the Right(TM) amount conciseness...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://479800]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-03-29 14:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found