Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Comparing images to find similar images in a database

by walkingthecow (Friar)
on Dec 03, 2014 at 10:51 UTC ( #1109097=perlquestion: print w/replies, xml ) Need Help??

walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

G'evening Monks!

Alright, so my title is a bit misleading, my apologies. I've been reading the article here on comparing images for similarities, deconstructing it, etc. Now, this script actually works pretty well for comparing images on my local filesystem. However, I'd like to grab the data from an image (whatever that data may be, a hash, etc), and find similar images in a database.

Now, from reading the article on Stonehenge, I would need to store the 48 vectors for each image. Then, when I want to compare an image to images stored in the database, I would need to loop over the 48 vectors of the image from I am comparing, against 48 vectors for each image in my database (i.e., $val += abs($image_to_compare1 .. 48 - $image_from_db1 .. 48). Obviously that is not an efficient way of doing this. I thought about just summing all the values for the vectors for each image, and storing that value in the database; However, the issue is that it's not the sum total of vectors from image A - the sum total of vectors from image B. Rather, it's the absolute value difference of each corresponding vectors in the two images.

Wow, so a long road to my question: What is the best way, using Perl, to take an image and find similar images in a database given a threshold. I'd like to store some sort of value in the database, calculate the value from the image to be compared, and then search for any value in the database that matches within a certain threshold.

P.S., Sorry if I was incoherent. My eyes are trying closing on me as I type this. Must get sleep! :)
  • Comment on Comparing images to find similar images in a database

Replies are listed 'Best First'.
Re: Comparing images to find similar images in a database
by Corion (Pope) on Dec 03, 2014 at 11:15 UTC

    Also, this article outlines a nice idea of finding "similar" images "simply" by comparing the colour histograms of five areas of the image. This doesn't necessarily find images with the same motives, but it finds images with the same colour composition ("blue sky", "yellow sand"). The linked code is more in Python, but I think the general concepts translate to Perl too, potentially using PDL.

Re: Comparing images to find similar images in a database
by james28909 (Deacon) on Dec 03, 2014 at 10:58 UTC
    I am not absolutely sure but Image::Seek looks promising, and it looks pretty straight forward to use as well.
      > but Image::Seek looks promising,

      yep, indeed.

      The first thing which came to mind after reading the title is wavelets, and Image::Seek is such an implementation.

      Cheers Rolf

      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      From what I can tell, it almost looks as if the Stonehenge post and Image::Seek are using the same algorithm. I'm looking specifically at haar.cpp from Image::Seek. Anyway, Image::Seek is a really good starting point. I think what I am going to do is make some minor modifications to the source to fit my needs. I will post my results once finished.

      Thanks for pointing me in the right direction!.
        > it almost looks as if the Stonehenge post and Image::Seek are using the same algorithm

        I don't think so... wavelets are fare more sophisticated and reliable.

        But it seems like the results of Merlyn's algorithm were already sufficient for you and you only need a good lookup strategy for your SQL-query.

        For this approach you are calculating if sum of distances of 48 vectors are below a threshold. (BTW: could you please fix your OP by adding code-tags?)

        succesive narowing

        So start by only looking up the set of images were the first vectors distance is below the threshold.

        Then you need to look up the images were 2. distance < threshold - 1. distance and so on.

        grid approach

        Grouping and indexing the vectors in a grid where the cells have the widths of the threshold should speed up the look up considerably.

        Then two vectors can only have a smaller distance if they are in the same or neighboring cells!

        Prefiltering over the 48 vecors like this would narrow the set of possible results considerably well, to allow a more detailed comparison.

        I don't know the inner optimization of your DB server sufficiently well to formulate an optimal SQL query for this, but this would be off topic anyway and I hope you get the idea mow. :)

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

Re: Comparing images to find similar images in a database
by choroba (Archbishop) on Dec 03, 2014 at 10:58 UTC
Re: Comparing images to find similar images in a database
by jonadab (Parson) on Dec 03, 2014 at 11:11 UTC

    If you only have 48 images to compare against, or 480 for that matter, you could indeed just loop through them all, calculating a total-difference statistic, sort them by that, and take the one with the lowest total difference.

    The problem arises when you have more like 48 million images in your database, at which point looping through all of them becomes prohibitive.

    The solution I can think of is going to be more SQL than Perl, though you could use Perl to build the SQL you want. The idea would be to develop a query that looks for images with some of their statistics being very similar to the ones for the current image. Since you can set up the database to pre-emptively index these fields, you can thus avoid the need to do all the difference calculations for every image in the database every time.

Re: Comparing images to find similar images in a database
by wollmers (Scribe) on Dec 05, 2014 at 19:22 UTC

    As I understand the author merlin defines a match as follows:

    =13= my $FUZZ = 5; # permitted average deviation in the vector ele +ments ... =66= BUCKET: for my $bucket (@buckets) { =67= my $error = 0; =68= INDEX: for my $index (0..$#vector) { =69= $error += abs($bucket->[0][$index] - $vector[$index]); =70= next BUCKET if $error > $FUZZ * @vector; =71= } ...

    IMHO the above set of matches is a subset of all matches where

    $pattern_sum += @pattern_vector; $upper_bound = $pattern_sum + $FUZZ * @pattern_vector; $lower_bound = $pattern_sum + $FUZZ * @pattern_vector; BUCKET: for my $bucket (@buckets) { my $bucket_sum += @{$bucket->[0]}; next BUCKET if ($bucket_sum > $upper_bound || $bucket_sum < $lower_bound); # found, do something }

    Depending on the randomness the matches will be roughly doubled, i.e. with $FUZZ=5, and a vector with 48 elements each an 8-bit integer, the number of possible different sums of the vector values is 48*255+1=12_241. The original method gives 48*5=240 as maximal allowed sum of the absolute differences. Thus the set of all possible sums is reduced by a factor of 12_241/240=51. When we use an interval of +/- 5, then 48*5*2=480, and the reduction is only 25. This means 1_000 images found out of a total of 25_000 images.

    But if we calculate the sum of the vector, we can store it as an integer field in the database and use SQL comparisons.

    The query result could still be refined using the original method, or something better like e.g. cosine similarity, which should be fast enough for ~1_000 vectors.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1109097]
Front-paged by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2020-01-26 21:25 GMT
Find Nodes?
    Voting Booth?