http://www.perlmonks.org?node_id=614223

danmcc has asked for the wisdom of the Perl Monks concerning the following question:

Kindly monks,

I have a large swathe of color information about a photo collection in RGB format. Given a range of RGB values, I would like to retrieve photos containing those colors.

Right now, all this data is in a database table, with four columns: a red value, a green value, a blue value, and a photo ID. Lookups now occur in some sort of fashion like: select photo_id from colors where r in (12,13,14,15) and g in (78,79,80,81) and b in (23,24,25,26)

This takes a long time.

I would like to put all this information in a giant in-memory data structure for super-fast lookups. But just what sort of data structure I need escapes my current spiritual wisdom.

A brute force approach of pre-compiling all possible RGB combinations and the photos they correspond to is possible, but I'm hoping a lone monk out there might know of a more elegant approach.

Humbly,
danmcc

Replies are listed 'Best First'.
Re: multi-dimensional range lookup
by BrowserUk (Patriarch) on May 08, 2007 at 20:14 UTC

    How many images are involved?

    You can fully index the colors used in your images using 768 strings. The length of those strings will be one bit for every image. So, you can fully index 1,000,000 images using 96MB.

    The data would be represented something like this:

    $var = { red => [ 0 => b'0101011100010101111 ...', 1 => b'"1000111100010101100 ...', ... ], green => [ 0 => b'0101011100010101111 ...', ... ], blue => [ 0 => b'0101011100010101111 ...', ... ], };

    To lookup all the images that contain red = 12, 13, 14, 15, and greenb = 78, 79, 80, 81 and blue = 23, 24, 25, 26, you simply bitwise-and (&) the appropriate 12 strings together, and then translate whatever bits are left set, back into the identities of the images. Very fast and very memory efficient also.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
        ... your code will also match for rgb(2,2,4) etc.

        I do not believe so. Given,

        • picture 1 contains r=1, g=2, b=3
        • picture 2 contains r=2, g=3, b=4

        And assuming for simplicity, 3-bit/color images, the data would be encoded as

        my %lookup = { ## 0 1 2 3 4 5 6 7 red => [ 0b00, 0b10, 0b01*, 0b00, 0b00, 0b00, 0b00, 0b00, ], green => [ 0b00, 0b00, 0b10*, 0b01, 0b00, 0b00, 0b00, 0b00, ], blue => [ 0b00, 0b00, 0b00, 0b10, 0b01*, 0b00, 0b00, 0b00, ], };

        Anding the 3 bitstrings selected by rgb( 2,2,4 ) (* above) together

        0b01 & 0b10 & 0b01 == 00 ==> No hits.

        So, for the simple case of looking up one rgb value at a time, there is no overlap.

        For the more complex case of locating images that contain either rgb( 1,2,3 ) or rgb( 2,3,4 ), there is the problem that oring all six selected bitstrings together would also select rgb( 1,2,4 ) and rgb( 1,3,4 ) and rgb( 2,2,4 ) etc., but that is different from the problem description, which is more akin to r = 1 or 2 and g = 2 or 3, and b = 3 or 4, for which my code algorithm description would work correctly.

        For the case you describe, that of only matching those images that contain either rgb( 1,2,3 ) or rgb( 2,3,4 ), you have to do it in two separate operations. And the individual strings of the first 3 values together and extract the image identities from the result. Then do the second set of three and extract the image identities.

        It's and versus or.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: multi-dimensional range lookup
by Joost (Canon) on May 08, 2007 at 21:32 UTC
Re: multi-dimensional range lookup
by injunjoel (Priest) on May 08, 2007 at 20:18 UTC
    One thing that comes to mind is possibly converting each of the color triplets into HSV color space and indexing them based on Hue... if you are not interested in the saturation or brightness of the color, but rather whether its there or not. Though there is no standard for the conversion most seem to use the following for conversion:
    V = max (r,g,b) S = (max (r,g,b)) - min (r,g,b)/max (r,g,b) H = depends on which of r,g,b is the maximum where r,g,b are your 0 - 255 values.
    just a thought... perhaps a CPAN module in the making eh?

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: multi-dimensional range lookup
by scorpio17 (Canon) on May 08, 2007 at 21:06 UTC
    This sounds like a good place to use PDL (on CPAN, of course). It's fast, handles large amounts of data (and hides the underlying data structures so you don't have to worry about it). And image processing is one of its main applications - check it out. :-)
Re: multi-dimensional range lookup
by wjw (Priest) on May 09, 2007 at 00:41 UTC
    A different approach: You seem not to be concerned about memory usage.. so.. . I might try to:
    • create a heap table (in memory table in MySql terms)
    • query just the red range into that table
    • run a query against the heap table to delete records with green values outside the range of what you want
    • then do the same type of query to delete the records which are outside the range of blue values you want
    The initial heap table will probably be huge, but I am guessing you have plenty of memory.

    This is of course a brute force method, and you may have to limit your initial query a bit more. However, once the initial query is done:

    • the subsequent delete queries will be very fast
    • you will recover unused memory

    Once you are down to the initially required set, further queries will be very, very fast.

    Not elegant, but I have done this with large data sets when I don't have the time to mess with re-designing the schema, or coding fancy algorithms (which I am not good at anyway), and it worked great!..

    Just a thought....

    • ...the majority is always wrong, and always the last to know about it...
    • The Spice must flow...
    • ..by my will, and by will alone.. I set my mind in motion
Re: multi-dimensional range lookup
by RL (Monk) on May 08, 2007 at 20:08 UTC

    How many datasets are in the database?

    Do you have indexes on the columns included in your query?

      Hi, There are about three million rows in the table. There is a multi-column index on the r, g, and b columns, as well as individual indexes on each of the r, g, and, b columns. danmcc
Re: multi-dimensional range lookup
by roboticus (Chancellor) on May 09, 2007 at 12:38 UTC
    danmcc:

    You've gotten plenty of interesting responses for doing the in-memory data structure. However, RL alludes to database indexing, and I thought the point was worth expanding on. If you have indexes on the color columns, the database access ought to be pretty zippy.

    If you're missing those indices, then the database will have to do a table scan (i.e., read every record to see if it matches the criteria). With those indices, the database software can just reach right in and grab the rows of interest.

    Making a single color lookup index on the three columns r, g, b would be a good idea if you're always going to supply all three components. However, if you want to just query for individual components, you may want to have a different index on each color. This way, the database server won't have to drop down to a table scan if you choose the last color in your index. So if you have a query like:

    select photo_id from colors where g in (78,120,180) and blue in (23,25 +,27)
    then the database could just read the list of photos for the acceptable green components from the index, and the same for blue, and then return the photos in both lists. If you have enough photos in your database, it will be *much* faster than a table scan.

    ...roboticus

Re: multi-dimensional range lookup
by Dervish (Friar) on May 09, 2007 at 04:43 UTC
    Err... are the /combinations/ of colors important? If I'm reading the suggestions correctly, I could make an image with 768 colors (all 256 pure red values, all 256 pure green values, all 256 pure blue values), and they would always match any query -- but what if you want to match an RGB value of (127,127,127)?

    My image would not contain that value, but the suggestions I'm seeing would match it.

    Now if you kept 16 million indices, with one bit for each image in your data set, you could use the suggested method. A set of 1000 images would take 2 billion bytes, which might be more than you want to allocate.

    Or am I making the problem too difficult? edit: note that 16 million is 2^24

Re: multi-dimensional range lookup
by Moron (Curate) on May 09, 2007 at 10:55 UTC
    Your analysis is basically correct, notwithstanding the various tangential tuning attempts. But I did notice you could probably gain significant performance instantly using BETWEEN instead of IN, that is with most DBMS's I know, e.g.:
    select photo_id from colors where ( r between 12 and 15 ) and ( g betw +een 78 and 81 ) and b between 23 and 26
    As for the data structure to load the lot into for a memory scan, there isn't anything purpose built for your problem (except possibly some linear programming modules that are OTT for your simple need). But whether hoh or aoh is the best fit depends on whether photo_id is numeric without gaps, but the lower level hash should have keys r,g,b with the respective values. Also use Storable and only update it if the database changes to avoiding having to read the whole database every time. To search the hash or array, for the aoh case something like:
    my @match = map Filter( 'r', $minred, $maxred, Filter( 'g', $mingreen, + $maxgreen, Filter( 'b', $minblue, $maxblue, $array[ $_ ] ))), @array +; sub Filter { my ( $colour, $min, $max, $href ) = @_; $href or return undef(); ( $href -> { $colour } >= $min ) or return undef(); ( $href -> { $colour } <= $max ) or return undef(); return $href; }
    for hoh just substitute $array[ $_ ]with $hash{ $_ } in the map statement;
    __________________________________________________________________________________

    ^M Free your mind!

    Key to hats: ^I=white ^B=black ^P=yellow ^E=red ^C=green ^M=blue - see Moron's scratchpad for fuller explanation.

Re: multi-dimensional range lookup
by halley (Prior) on May 09, 2007 at 18:39 UTC

    Coming from an image-processing background, I have to ask if you're blindly indexing a whole image for just one or two stray pixels of a given RGB value.

    If you request a bright blue RGB, you probably aren't interested in 50 pictures of roses, each of which happen to have two or three pixels in obscure areas that reflect sky from the tinest of dewdrops, or worse, sensor noise. I expect you're more interested in finding the images that have bright blue as a dominant or at least prominent color.

    If this is indeed your intent, this is what I would do (and have done similarly):

    1. downscale your original photograph to something like 100x100px max, by blending not sampling,
    2. quantize it to the Safe216 color palette; this gives all combinations at six levels of R, G and B components,
    3. mark all the Safe216 colors that have at least a minimum count in the histogram,
    4. index your original image to all these marked Safe216 colors with a very simple table for prominence,
    5. repeat 1~4 with a 16x16px downsample, and index these colors for dominance,
    6. when querying, search using the nearest Safe216, and/or the nearest adjacent or similar Safe216 values.

    The step 1 is to reduce your photo to a color set including only colors that have prominence. The use of Safe216 is to work with a much smaller and consistent index; 216 colors is easier than 16777216, and still works fine when you use RAW or HDRI. The threshold to include a Safe216 is adjustable, perhaps even N>1 is fine for most of your uses. It's also a lot easier to figure out a small number of adjacent colors, for matching all "basically orange-ish" colors, than to worry about database range-matching implementation details.

    --
    [ e d @ h a l l e y . c c ]

Re: multi-dimensional range lookup
by 13warrior (Acolyte) on May 09, 2007 at 06:25 UTC
    Hi, If you want giant in-memory storage - you can use MDBM or an SQLITE. Sqllite is an memory DB, so query should be more responsive
Re: multi-dimensional range lookup
by mattr (Curate) on May 10, 2007 at 08:26 UTC
    I looked at mysql's spatial point types which are 2D but you can add an index to them to perhaps make them 3D.. but hey that's not fun right?

    I treated the problem as finding points within a sphere of some radius around a point in question. It's not something you want to do for a million photos just yet, but... Actually it's quite fast! (So you can use Perl DBI or DBIx::Class with about the same speed.) This is a search for all points within a 20 pixel radius sphere centered around the color (25,25,25).

    # mysql -uroot -pPu5huqep space -e 'desc point' +-------+---------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+---------+------+-----+---------+----------------+ | id | int(11) | | PRI | NULL | auto_increment | | x | int(11) | | | 0 | | | y | int(11) | | | 0 | | | z | int(11) | | | 0 | | | alpha | int(11) | | | 0 | | +-------+---------+------+-----+---------+----------------+ # for (( i=1; i<=1; i++ )); do `mysql -uroot -pPu5huqep space -e 'ins +ert into point (x,y,z) values (255*RAND(),255*RAND(),255*RAND());'` ; +done # mysql space -e 'select count(*) from point' +----------+ | count(*) | +----------+ | 10105 | +----------+ # time mysql space -e 'select count(*) from point where (( ((x-25)*(x- +25))+((y-25)*(y-25))+((z-25)*(z-25))) < 400)' +----------+ | count(*) | +----------+ | 39 | +----------+ real 0m0.013s user 0m0.006s sys 0m0.003s
    So the idea would be to generate a palette of say the top 100 colors for each photo, where the photo id goes in the 'w' column and the r,g,b for each point goes in columns x,y,z. Then the 10,000 points I used would represent 100 photos. So it would seem in 10 seconds mysql could return a query for 100,000 photos using this method (I haven't timed it for that many yet though). Here's a bigger test, the search takes the same amount of time regardless of radius.
    # mysql space -e 'select count(*) from point ' +----------+ | count(*) | +----------+ | 100106 | +----------+ # time mysql space -e 'select count(*) from point where (( ((x-25)*(x- +25))+((y-25)*(y-25))+((z-25)*(z-25))) < 400)' +----------+ | count(*) | +----------+ | 310 | +----------+ real 0m0.049s user 0m0.006s sys 0m0.002s
Re: multi-dimensional range lookup
by CountZero (Bishop) on May 09, 2007 at 20:27 UTC
    This takes a long time
    What do you mean by "a long time"? Seconds, minutes, hours?

    Normally a good database should find an answer to your query quite fast using indices on all columns. After all, databases are purpose built for such look-ups! What type of database are you using?

    Getting all the data from the database into an in-memory datastructure will take quite some time too, so the initial start-up cost will be significant and hoping you can keep everything in RAM and the computer does not start swapping it out to swap-space.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law