Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

compare images

by Anonymous Monk
on Mar 11, 2006 at 06:05 UTC ( #535885=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there a Perl way to compare images with each other to see which images are identical?

I have a lot of user uploaded images on my server and a number of them are identical. I'm looking for a way to scan the directory and report back which images are identical.

I searched here but didn't find anything and I wouldn't know what to search for on Cpan.

Replies are listed 'Best First'.
Re: compare images
by Fletch (Chancellor) on Mar 11, 2006 at 06:52 UTC

    Depends on what you mean by "identical images". If you want to check for copies of the same file, use something like Digest::MD5 or Digest::SHA1 and compare the generated digests (same digest, identical file contents; different digest, different pictures).

    If you mean "looks the same but the file's different" (e.g. a PNG that's been converted to a JPG of identical size) that'd be much more difficult (and I can't even think of where to tell you to start looking; although it'd probably involve some sorts of deep Computer Science academic literature).

      In response to the "looks the same but the file's different" part, maybe something like what Simon Cozens hacked up: Image::Seek. I haven't tried it on my photos personally, but it seemed to look work pretty well on his photos.
Re: compare images
by atcroft (Abbot) on Mar 11, 2006 at 06:55 UTC

    If you know the images have not been modified after being uploaded, then you could compare checksums (MD5 or SHA) for each file. An example might progress like:

    use strict; use Digest::MD5; my (%files); my $md5 = Digest::MD5->new; foreach my $dir ( @ARGV ) { next unless (-d $dir); foreach my $f (<$dir/*>) { next if (-d $f); open(FILE, $f) or die qq{Can't open $f: $!\n}; binmode(FILE); $md5->addfile(*FILE); close(FILE); push(@{$files{$md5->hexdigest}}, $f); $md5->reset; } } print qq{The following entries appear to be duplicates, }; print qq{and warrant closer examination:\n}; foreach my $k ( keys %files ) { if ( scalar @{$files{$k}} > 1 ) { print qq{\t}, q{Checksum: }, $k, qq{\n}; print qq{\t\t}, join( qq{\n\t\t}, @{$files{$k}} ), qq{\n}; } }

    This will give you output that looks like the following (actual file names changed to protect the clueless):

    $ ./compare-test.pl . The following entries appear to be duplicates, and warrant closer examination: Checksum: d41d8cd98f00b204e9800998ecf8427e ./file-01.txt ./file-1.txt ./file01.txt ./file1.txt Checksum: 520bd68306a6bd9aa586a80ee692c750 ./file-2.txt ./file2.txt Checksum: 024cde173d464feee746320b300b5c35 ./file-04.txt ./file04.txt Checksum: 3ef60d5a2fa552f395e154bb5418893c ./file-5.txt ./file5.txt

    Hope that helps.

Re: compare images
by BrowserUk (Pope) on Mar 11, 2006 at 07:02 UTC

    This detects identical files in the current directory and lists the name of one of them for deletion.

    Of course, it won't detect the same image in two different formats (.jpg -v- .png), or the same format at different levels of compression, or 8-bit -v- 24-bit color etc.

    wrapped for posting...quoted for win32

    perl -MDigest::MD5=md5 -000le" print for grep{ ++$h{ md5( do{ local @ARGV = $_;<> } ) } > 1 } + map{ glob} @ARGV" *

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: compare images
by superfrink (Curate) on Mar 11, 2006 at 07:14 UTC
    Hash functions like MD5 can have collisions. This means two different files can have the same MD5 checksum.

    I would calculate the MD5 when the file is uploaded and store the MD5 sum in a database. The database could be used to find files with the same checksum. The important part is before removing one of the files with a duplicate MD5 checksum I would compare the files that have matching MD5 checksums.

    In the unix world you could use the "cmp" command line program. I would probably do the file comparison in Perl so the script did not have to fork child processes to run "cmp" and so the script will run on systems without "cmp".

    Have a look at How can I compare (the content of) two files?.
      Hash functions like MD5 can have collisions. This means two different files can have the same MD5 checksum

      You're right, they can. Indeed, I swear I actually saw this once when generating md5s from 1,000,000 web pages; but I never suceeded in reproducing it. People who understand statistics tell me that the likehood is extremely low. Like so low (unless you delibertely set out to achieve it), that hell is likely to freeze over first--or something like that :)

      If you ever actually encounter two real files with the same md5s, and they are not proprietory or private, could you let me have a copy of each. I have some analysis code I would like to run on them.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        It is supposed to be very rare. I don't know how rare. I don't think I have ever seen it happen by accident. At least I have never noticed. Even though it is rare I try to avoid programming with the view "that probably won't ever cause problems".

        I was curious so I googled and found two postscript files with the same MD5 hash. To be fair someone did set out to generate two files with the same MD5 checksum.
        Some people actually have developed method to create MD5 collisions. See :http://www.cits.rub.de/MD5Collisions/, there you'll find 2 very different postscript files sharing the same MD5.
        There were interesting dicussions about this on Bruce Schneier blog and in his Crypto-gram newsletter, see : http://www.schneier.com/
Re: compare images
by zentara (Archbishop) on Mar 11, 2006 at 12:04 UTC
    Hash functions like MD5 can have collisions.

    That's no excuse, just choose a better hashing algorithm like sha-512. It's really a comprimise between speed and security. Md5sum will be faster, some low-bit sha has been reported to have collisions, but high bit sha, and others like doing "shasum -512 $file" will not give you collisions in any real setting.

    In the interest of speed, you can use the lowest hash, a 16-bit checksum, like "sum $file", which is very fast. Then only if the sums match, do a sha-512 or md5sum.

    To your original question of comparing images, the Imager::Filter module (part of Imager ) can detect differences, which you can probably use for deeper comparison.

    $out = $img->difference(other=>$other_img);

    I'm not really a human, but I play one on earth. flash japh

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://535885]
Approved by atcroft
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2017-11-18 05:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:













    Results (277 votes). Check out past polls.

    Notices?