Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Comparing images

by Anonymous Monk
on Nov 27, 2006 at 02:22 UTC ( #586173=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

There once was a post here on how to compare images to possibly detect if some images were IDENTICAL to another image to prevent users from uploading an image that is already on file. I don't remember the title of it and I remember someone saying it's not 100% accurate but it's more than accurate enough for a message board.

My question is, is there a way to compare just, well, the CODE of images to each other? Or like the image comparison stuff?

It's hard to explain but to compare a new uploaded image to ALL uploaded images (a few thousand) isn't exactly possible so I was wondering

1) How do we compare images to each other?
2) Is it like a hex code or such where I can save the string of chars to a database and just compare that way instead of re-analyzing each image?

Thanks for your help!

Replies are listed 'Best First'.
Re: Comparing images
by SheridanCat (Pilgrim) on Nov 27, 2006 at 02:32 UTC
    You could retrieve and store the MD5 checksum for each image. Check out the Digest::MD5 modules for more on calculating the value. There are caveats to doing it this way - such as getting false positive matches - but it may at least point in a direction you can go.

      For a ready made solution of mine along these lines, see Re^3: Identical Files to Symbolic Links. It is not focused on images, but on files -like any solution of this kind- but that's what the OP seems to want anyway. BTW: I still plan on rewriting it, but "ASAP" has not come yet. And it's still serving me right for the moment. Go figure!

      I might be more concerned about false negative matches, but it depends upon what the OP meant by comparing images. You could upload a jpeg of a dog that gets an MD5sum of 00000 (or whatever). I could upload a png of that same photo and that gets an M5sum of ABDEF. Someone else could come along an upload another jpeg at a slightly different compression setting and get an MD5sum of A188F.

      They're all different files, but they all represent the same image. To try and compensate for that, it'd require a more advanced technique like one of the articles linked to later on here in the thread to actually compare upon how the image actually looks, not merely how it's stored.

      Of course, if you just want to make sure the same file isn't uploaded twice, the md5sum should do just fine.

Re: Comparing images
by GrandFather (Sage) on Nov 27, 2006 at 02:59 UTC
Re: Comparing images
by arkturuz (Curate) on Nov 27, 2006 at 09:51 UTC
    A few years ago I read some merlyns article on comparing images. Unfortunately, seems to be down right now, but I found the link on Google (cached page): Finding Similar Images so check it out later. It might be of help, as I remember it was an interesating article.
Re: Comparing images
by BrowserUk (Pope) on Nov 27, 2006 at 15:16 UTC

    Once you've checked the size of the file and found it to be the same as an existing file, if you read a 32-bit word from the middle of both files and compare them, it will eliminate the need to run a full md5 checksum in 99.999999977% of cases.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That's true for any sort of compressed image. But if some of the files are raw bitmap, then filesize is just a function of resolution, and I don't think it's unlikely to find two files with a black or white pixel in the same place.

        This is a statistical tactic so the file format is not a factor. Assuming binary files where all 256 values are possible.

        If you pick a random offset within two files being compared, and compare the byte values at that offset, then the odds that they will have the same value is:

        256! / (256-2)! / 256^2 = 0.00390625 or 0.4%

        And if you pick two random offsets and compare the bytes drawn from both files at those offsets, then the odds against them both being the same (if the files are different) is the above value squared:

        0.00390625^2 = 0.0000152587890625 or 0.0015%

        Now, seeking to a random offset is much more expensive than reading 2 bytes instead of one once you are there. So, what are the odds of a word (2 sequential bytes), read at the same (random) offset from two files being the same?

        65536! / 65534! / 65536^2 = 0.0000152587890625 or 0.0015%

        Ie. The same as the two random offset case above. And by extension, selecting two words at 2 offsets gives us a probability of:

        0.0000000002328 or 0.000000023%

        Again, if the 2 words are read as a single dword at a single offset, then the odds remain the same.

        In other words, the odds of two non-identical files containing the same 32-bit value at the same offset are statistically vanishingly small.

        So, when comparing files, if the sizes are the same, there is still no need to read the whole file and perform a checksum or hashing algorithm on them quite yet. By storing a random offset/32-bit value pairing, along with a file's size and checksum/hash, the occasions on which it will be necessary to actually compute the checksum/hash are reduced almost to nil.

        The choice of random offset deserves some thought.

        Many filetypes have headers which contain control information which is either

        • static (as in the 'MZ' in the first two bytes of a dos/win executable; the 'GIF8' bytes in .gif files; the 'JFIF' at offset 6 of jpeg files; etc.).
        • or frequently the same in files of the same type containing different data. For example, the 32-bit fields at offset 16 & 20 (width & height respectively) of a .png (and similar fields in other image formats) will contain the same values for images of the same width/height regardless of the content of the images.

        There are many other similarly non-diagnostic fields which should be avoided. A simple strategy for avoiding these (in most cases), is to use an offset derived from the filesize (which also reduces the volume of data that needs to be accumulated/stored). Eg. Reading the 32-bit value stored at the halfway point (suitably rounded down to the nearest 4-byte boundary) will avoid most headers in most file formats.

        Although this simple (and fast) tactic is not guaranteed to weed out duplicates, the final sanctions are still the calculation of a full checksum or hash from the entire file, or even a full byte-wise comparison, so the risk is negligible. But the tactic serves to eliminate those final, relatively expensive strategies in all but a minuscule number of cases.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      Read one pixel out of an arbitrarily large file?

      Nine Nines and two sevens? that's 33 out of a hundred billion!

        Yup. Did you read this? Remember, if the two values are the same, you then go on and run a full md5, so there is no risk. But if they are different you saved yourself the bother/expense of doing it. And statistically, that should be the case in a very large percentage of cases.

        The reality of any given set of data probably won't reach that theoretical maximum. For a start, many if not most images don't use the alpha byte, so the range of values is reduced to

        ( ( 2**24 * (2**24 - 2) ) / 2**48 ) == 0.999_999_887_92

        but that's still pretty good odds for the effort of reading 4 bytes and comparing two integers.

        Of course, if the two pictures being compared are

        1. A 640x480x24-bit color image of a black cat in a coal celler inside the Artic Circle in winter, with no flash.
        2. A 640x480x24-bit color image closeup of a black hole.

        Then this quick and simple test may not discriminate between them. But then, will the viewers? :)

        More seriously, it's possible that the camera that took the images has a bad cell in the CCD that means that one pixel in the same place on every image is always black (or white or red), and this test would fail to distinquish them if it happens to check that exact pixel. But if the dword compared is (semi-)randomly chosen, that would be pretty unlucky.

        Even if you mount the camera on a tripod and use a remote trigger to avoid micro-seismic disturbances, and take two frames one after the other, with 16 mlllion colors to choose from for each pixel, even the slightest variation in the light, or focus, or even the battery charge is likely to cause variations in the pixel colors at identical positions in identical shots by the same camera. In 8-bit color/grey scale images, the variation will be less, but then you are comparing 4 pixels not one. In a strictly B&W image, you would be comparing 32 adjacent pixels.

        But when the statistically improbable happens and you get a false positive, that false positive would be caught by the full md5 anyway. The point is to save time by avoiding that full md5 (or similar) where possible--not ditch it all together.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Comparing images
by ambrus (Abbot) on Nov 27, 2006 at 21:47 UTC

    I think it's usually enough to compare the images byte-to-byte. You can do this by storing the md5 sums of all the images somewhere, and when someone wants to upload an image, calculate its md5 sum, compare to all existing ones, and if it's new, add the image and store the md5 sum.

    You can calculate md5 sums either with the Digest::MD5 module (which is now distributed with perl) or with the md5sum program found on most unices, or with the GnuPG program.

    If you really want to compare the images pixel-by-pixel, that is, compare different formats of the same image as equal, then you have to recompress all the images to a certain format of your choice (also removing all metadata). I don't really think this is really neccessary.

Re: Comparing images
by gam3 (Curate) on Nov 28, 2006 at 01:27 UTC
    I am currently doing this here by simply letting mysql compare the first 256 bytes of each image. While images could be created that would cause a false positive with this test it is not a problem in this application. I am simply trying to keep the exact same file from being uploaded more than once.
    CREATE TABLE `image` ( `id` mediumint(9) NOT NULL auto_increment, `type` varchar(20) default 'jpeg', `mime_type` varchar(20) default 'image/jpeg', `image` mediumblob, `size` mediumint(9) default NULL, `width` int(11) default NULL, `height` int(11) default NULL, `user_id` mediumint(9) default NULL, `created` datetime NOT NULL default '0000-00-00 00:00:00', `copyright_id` mediumint(9) default NULL, `comments` text, `hidden` enum('true','false') default 'false', PRIMARY KEY (`id`), UNIQUE KEY `uk_image_image` (`image`(256)) )
    -- gam3
    A picture is worth a thousand words, but takes 200K.
Re: Comparing images
by Anonymous Monk on Nov 27, 2006 at 15:49 UTC
    Added to an email every type of file ( so called MIME type) becomes encoded to a 7-bit ASCII format. In this format the files may be easily compared.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://586173]
Approved by grep
Front-paged by broquaint
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2018-01-17 01:57 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (194 votes). Check out past polls.