Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Perl Hashes in C?

by BrianP (Acolyte)
on Aug 11, 2015 at 18:31 UTC ( [id://1138218]=perlquestion: print w/replies, xml ) Need Help??

BrianP has asked for the wisdom of the Perl Monks concerning the following question:

I have a small Perl script that takes a crop spec with the XY position, width and height of a box, grabs the 48 bit RGB as a hash key and increments the data. It takes just a second on a ~100x100 pixel rectangle.

I use it to measure the quality of color spaces and various photo processes.

I wrote something much more clunky in C for use on entire, 216 MB pictures. But it takes 2.17 HOURS to run on a D800E pic with 27 Million unique colors. The lookup table made up of 64bit long longs gets to be 27E6 long toward the end and takes ~25 bsearch comparisons per lookup.

What I need is a good algorithm to take a 48 bit integer with quantomly random values and spit out an integer in the range 0-36E6. Perl has some pretty slick Hashing capabilities. I wonder whether any of them could be leveraged here.

GORY DETAILS: ----------

I have a C program which uses an array of 36 Million uint64s to hold 48 bits of RGB color data (16bits/channel) and 16 bits for a counter for the distinct colors found when scanning a 216 MB .RAW file.

For extremely colorful pictures, the length of this lookup table gets as high as 27 Million items. Even with a bsearch, it takes over 2 hours to count the number of pixels sharing each distinct color.

How does one HASH 48 bits of ~random data RGB data and return an integer in the range 36E6 to as high as many GB with 32GB of system RAM?

CURRENT PROCEDURE: -------------------------------

BSearch the known colors for the new color and increment the 16 bit counter if found. Else, append the new, unique colors to the end of the array and periodically QSort the new ones. Then I do a shuffle merge on the millions of old, sorted colors and the hundreds of new, colors (after qsorting the new) appending as many as possible from each array onto a merge array and go back and forth until old and new are all merged.

QSorting an array with millions of sorted and a few hundred unsorted seems like a waste. If bsearch fails on the sorted bottom, I do a linear search on the unsorted top and append the color if not found.

The BSearch comparison function masks off the highest 16 bits (0x0000FFFFFFFFFFFF) of *A and *B and returns 1 if *A & MASK > *B & MASK, -1 if <, else 0. Can't think of a faster way to compare 2 long longs from pointers.

What I need is a Perl Hash (just do it) and C speed; Cerl Language? tags hash raw rgb 48bit

Replies are listed 'Best First'.
Re: Perl Hashes in C?
by BrowserUk (Patriarch) on Aug 11, 2015 at 20:28 UTC

    This drops into Inline::C to explore the bitmap and construct the hash. It takes 23 seconds to process a 125 megapixel image:

    #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef unsigned int U32; typedef unsigned char U8; HV* countColors( SV *img ) { STRLEN l; U32 *rgba = (U32*)SvPVx( img, l ), i; HV* counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { if( hv_exists( counts, (char*)&rgba[ i ], 4 ) ) { SV **val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 ); SvIV_set( *val, SvIV( *val ) + 1 ); } else { SV *val = newSViv( 1 ); hv_store( counts, (char*)&rgba[ i ], 4, val, 0 ); } } return counts; } END_C use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; use GD; GD::Image->trueColor(1); my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!; my $gd = $i->gd; my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $ +gd, 0, 11, '' ); printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd +) / 4; my $start = time; my $href = countColors( $gd ); printf "Took %f seconds\n", time() -$start; printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su +m( values %{ $href } ); __END__ C:\test>1138218 mid.png Width:12800 height:10240 pixels:131072000 Took 23.391081 seconds Found 81814 colors in 131072000 pixels

    The handling of the hash could be made more efficient by only calculating the hash-values once instead of several times; and by pre-extending the hash.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      A small change doubled the performance:

      HV* countColors2( SV *img ) { STRLEN l; U32 *rgba = (U32*)SvPVx( img, l ), i; HV* counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { SV **val; if( val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 ) ) { SvIV_set( *val, SvIV( *val ) + 1 ); } else { SV *val = newSViv( 1 ); hv_store( counts, (char*)&rgba[ i ], 4, val, 0 ); } } return counts; }
      C:\test>1138218 mid.png Width:12800 height:10240 pixels:131072000 Took 12.642578 seconds Found 81814 colors in 131072000 pixels

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      These are TrueColor. I am dealing with 281 TRILLION, full 16 bits/channel, 48 bits/pixel.

      "Found 81814 colors in 131072000 pixels" -> 1602 Pixels per color.

      The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M "Pixels=36152321, unique Colors=27546248=76.19%"

      76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.

      I am monkeying with the MAX_UNSORTED parameter which determines when a sort has to be done after so many new, random colors have been piled on top of the lookup table.

      I had it set at a way, way too low 200. I wrote a Perl script to run the C program with varying MAX_UNSORT numbers and are seeing vastly better performance with 3805 is the best so far. The linear searches on top of the pile are pretty cheap compared to QSorting and merging.

      With a 1 in 3 sampling (12M of 36M), I have it down to < 46 seconds with 88.55% unique colors

      The larger the number of unique colors, the more it pays to leave a pile of unsorted colors on top.

      The one I did before was a ColorMAtch colorspace and it had ~76% unique colors. This one is ProPhoto and is over 85%! Same NEF file, same ACR settings, no photoshop other than to import from ACR and save as RAW.

      It looks like I need to work on the Sort_Merge. QSort on the entire 27 million tall stack, 99% already sorted was taking 98% of the program time. The shuffle_merge is 100 times faster on this problem

        The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M

        This is a perl creating a hash with 27 million keys:

        [0] Perl> $t = time; ++$h{$_} for 1 .. 27e6; print time() - $t;; 52.9575479030609

        53 seconds!

        76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.

        Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that?

        (If you mean something else by the highlighted phrase above, you're gonna have to explain yourself better.)

        The beauty of hashing for this application is that it doesn't matter what the range is, only the actual total.

        For each pixel in your image you either need to add a new key; or increment an existing value. Either takes approximately the same amount of time: circa: 0.000000457763671875 of a second on my rather ancient hardware.

        Indexing your 48-bit values (as opposed to my 32-bit ones) will take ~50% longer; so perhaps 40 seconds to count the colours in my 125 mega-pixel image.

        I have it down to < 46 seconds with 88.55% unique colors

        If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time by asking this question?

        Another monk that I won't waste my time reading and producing solutions for in future.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: Perl Hashes in C? (Updated)
by flexvault (Monsignor) on Aug 11, 2015 at 21:02 UTC

    Welcome BrianP,

    On a Linux 32 bit system, Perl still allows 2**53 integers if you compile Perl for 64 bit integers: (Note: If Perl is a 64 bit version, then the problem goes away!)

    rperl -e 'use integer;$i=((2**17)**3);$m=2**53;print "$i\n$m\n";' 2251799813685248 9007199254740992
    As you can see the maximum integer is larger than what you need. So you can generate integers directly by multiplying the 3 16 bit
    <UPDATE>
    To make sure nobody just multiplies the RGB together please remember that to represent the number 307 in decimal use:
    perl -e '$n = (3*(10**2)) + (0*(10**1)) + (9*(10**0)); print "$n\n";'
    So the formula is ($B is the base):
    perl -e '$B = 10; $n = (3*($B**2)) + (0*($B**1)) + (9*($B**0)); print +"$n\n";'
    So to generate 48 bit RGB integers you need:
    perl -e '$B = 2**16; $n = ($R*($B**2)) + ($G*($B**1)) + ($B*($B**0)); +print "$n\n";'
    Clarification complete!
    </UPDATE>
    values and then use that value as a hash key. Then each time you add a count just do:
    $hash{$key}++; ## $key is 48 bit RGB as a single integer ## The value $hash{$key} is sum of the time +s found

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Perl Hashes in C?
by FreeBeerReekingMonk (Deacon) on Aug 11, 2015 at 19:30 UTC

    Stick to Perl, use the underlying C routines, and the hash capabilities of perl (unless you run out of memory)
    Once you binary read a pixel and have 48 bits of data, it can be unpacked to a large integer.
    However, for speed, you should not use 3 times a 16 bit value, then shift them together, but use quad values, which only are available if your perl is 64 bit. Check Pack/Unpack Tutorial (aka How the System Stores Data)
    But still, in perl, it would be better to unpack to hex strings, which are much shorter (thus faster), and use those as keys in a hash counter, like so:

    my %SEEN; my $key; open( FH, '<', "image.raw" ) or die $!; binmode(FH); my $startpos = 10; # skip header seek(FH, $startpos, 0); $/ = \42; # set record size while( <FH> ) { $key = join("", unpack 'H42', $_ ); $SEEN{$key}++; # not sure how to calculate stop } for my $key (keys %SEEN){ print "key $key seen $SEEN{$key} times\n"; } # Add method to go from hex key to numeric rgb values if you want
      I have PerlMagick read the file into a blob
      Unpack the blob into a very long array of unsigned shorts
      Calc the compound, multi-channel values for Yellow, Purple Cyan and White and hash them all
      The array slice makes it a great deal faster than segmenting it into pixels
      $im = Image::Magick->new(size=>$xyr, type=>'TrueColor', depth=>16); $err=$im->Read(filename=>"RGB:$raw"); warn "$err" if $err; # Read RA +W ($blob) = $im->ImageToBlob(); $bl=length $blob; @ushort=unpack("S*", $blob); $blen=length $blob; $ulen=scalar @ushort; printf("CR: $lbl Blob len $blen unpacks to $ulen uint16s -> %d pix + -> %4.6f B/us\n", $blen / 6, $blen/$ulen); for($ii=0; $ii < scalar @ushort; $ii+=3) { ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice # Show full integers in 4Gig range for Yellow, purple, cyan $yy=sprintf("%10d", $rr*$gg); # White=4294967296 (4,294,967,29 +6) 10 dig $pp=sprintf("%10d", $rr*$bb); $cc=sprintf("%10d", $gg*$bb); # (2^16)^3 = 2^48 = 2.815E14, 15 sig figs MAX. # Doubles have 52/53 bits of Significand $tt=sprintf("%15d", $rr*$gg*$bb); # TRUE color, R*G*B??? $c2v2c{r}{$rr}++; # Count Red channel points with this value $c2v2c{g}{$gg}++; # Count Gre channel points with this value $c2v2c{b}{$bb}++; # Count Blu channel points with this value $c2v2c{y}{$yy}++; # Count Blu channel points with this value $c2v2c{p}{$pp}++; # Count Blu channel points with this value $c2v2c{c}{$cc}++; # Count Blu channel points with this value $c2v2c{t}{$tt}++; # Count Blu channel points with this value } $vt=0; # Values Total; sum of all values foreach $chan (@leg) { # sort keys %c2v2c $clr = $leg{$chan}; # y -> Yellow %v2c = %{$c2v2c{$chan}}; # Values -> Count_of_this_value hash $vc = scalar keys %v2c; # Distinct values for this channel $vt += $vc; # Value Count accumulator push @vc, "$clr $vc"; } printf("CR: $lbl Chan Value Counts: %s, tot=$vt\n", join(", ", @vc +)); return $vc; # Last ValueCount should be for T, TrueRGB } # End Count_Rgb().
      Looking at the log file:
      3), R pf-2015.0531-249465.srgb.7360x4912.raw , crop_wh=577,428, blob= +1481736 B -> 6.000000 B/p CR: srgb Blob len 1481736 unpacks to 740868 uint16s -> 246956 pix -> 2 +.000000 B/us CR: srgb Chan Value Counts: Red 18955, Green 16919, Blue 10862, Yellow + 236778, Purple 221564, Cyan 222509, TrueRGB 225558, tot=953145 Distinct RGB colors=225558 = 91.335% -> 1 dup clr / 11.54 pix Elapsed time = 9.67 sec
      IT does 4 of these in almost 10 seconds, but look at the pixel count; < 1/4 MegaPix

      The hash only has to deal with 91% of 1/4 MB or 225,558 226 Thousand unique colors.

      Scale this to 30 MILLION unique colors and the time factor goes from 9 seconds to 9 DAYS!

      I have the C version down to 3 minutes flat with a 7920 Unsorted limit and 2.77 Million unique colors!

        This was the code example from FreeBeerReekingMonk: ----------------------------- my %SEEN; my $key; open( FH, '<', "image.raw" ) or die $!; binmode(FH); my $startpos = 10; # skip header seek(FH, $startpos, 0); $/ = \42; # set record size while( <FH> ) { $key = join("", unpack 'H42', $_ ); $SEEN{$key}++; # not sure how to calculate stop } for my $key (keys %SEEN){ print "key $key seen $SEEN{$key} times\n"; } # Add method to go from hex key to numeric rgb values if you want ================================= I have no idea where the \42 came from. I changed it to a record size=6. Then changed the Unpack from 'H42' to 'H12' 12 hex = 6 bytes. It works: # The RIGHT ANSWER IS: RGB hash has 27645898 distinct colors. sub rgb_hash2() { my %SEEN; my $key; # 27645898 colors 216913920 B 36152320 pixels! $raw="pf-2015.0531-249465.pprgb.7360x4912.raw"; # $fs=-s $raw; # RAW file size in BYTES open( FH, '<', "$raw" ) or die $!; binmode(FH); # RGB 16 BITS/CHANNEL = 48BITs/PIXEL = 6 BYTE QUANTA -> 12 HEX digits $/ = \6; # set record size. REF 2 INT, "\6" == 6 B #$/ = \42; # set record size while( <FH> ) { # Smash each 12 Hex rec into 48 BIT UINT48? #$key = join("", unpack 'S3', $_ ); $key = join("", unpack 'H12', $_ ); $SEEN{$key}++; } close FH; $sk=scalar keys %SEEN; # Scalar Keys. How many COLORS? printf("Seen{%d}, file_size=%2.6fMB, -> %1.6f B/key\n", $sk, $fs/1.0E6, $fs/$sk); return $sk; } I also tried it with 3 Unsigned Shorts too. ===================================================== ===================================================== Monk Ed suggested something at a bit lower level with a sysread into a tiny 4096 * 6 byte buffer and nibbling the buffer substr. If you monkey with the buffer size and try a size not evenly divisible by 6, it breaks. D'oh! =============================================== Ed's ~ code (may be partially mangled by me). %Image = (); keys %Image = 4096 * 128; $rdamt = 4194304 unless $rdamt; # 4 MByte! 4096 * 6; $compraw = -s $file; $pixels = int ( $compraw / 6 ); open ( $in, "<", "$file") or die "$!\n"; while( 1 ) { $size = sysread( $in, $buffer, $rdamt ); if ( $size == 0 ) { last; } while( $buffer ) { if ( length( $buffer ) < 6 ) { last; } $key = substr( $buffer, 0, 6, '' ); $Image{$key}++; } } close $in; $uniq = keys %Image; =============================================== PERFORMANCE SYSREAD with buffer size Event 'Read_Raw_16384' elapsed time = 0.520 min = 10.43%. Event 'Read_Raw_24576' elapsed time = 0.519 min = 10.41%. Event 'Read_Raw_4096' elapsed time = 0.525 min = 10.54%. 10.54% -> Read_Raw_4096 10.43% -> Read_Raw_16384 10.41% -> Read_Raw_24576 << Ed's Magic Number 4096 * 6 Average ~= 0.5 min Record read, unpack to HEX Event 'Read_Raw_Hex0' elapsed time = 0.781 min = 15.68%. Event 'Read_Raw_Hex1' elapsed time = 0.711 min = 14.26%. Event 'Read_Raw_Hex2' elapsed time = 0.704 min = 14.13%. Event 'Read_Raw_Hex3' elapsed time = 0.707 min = 14.18%. 15.68% -> Read_Raw_Hex0 14.26% -> Read_Raw_Hex1 14.18% -> Read_Raw_Hex3 14.13% -> Read_Raw_Hex2 Average ~= 0.7 min Ed's magic number beat out the others tried by an eyelash! Tried Unpacking 3 UShorts also. Seems like it should be faster to unpack 3, 2 byte ints than 12 hex: Event 'Read_Raw_UINT16x3Pack0' elapsed time=0.959 min=25.82% Event 'Read_Raw_UINT16x3Pack1' elapsed time=0.919 min=24.76% Event 'Read_Raw_UINT16x3Pack2' elapsed time=0.918 min=24.73% Event 'Read_Raw_UINT16x3Pack3' elapsed time=0.917 min=24.70% 25.82% -> Read_Raw_UINT16x3Pack0 24.76% -> Read_Raw_UINT16x3Pack1 24.73% -> Read_Raw_UINT16x3Pack2 24.70% -> Read_Raw_UINT16x3Pack3 NO! Hex Unpack is faster that UINT16 on the same data and by a sizable margin, The low level sysreads without buffering Gets the GOLD MEDAL! There was another INLINE C method from BrowserUk that looked perfect, but it used GD with "TRUECOLOR" which on my system means 24 bit and my data are all 16 so I didn't try it. What started as a 2 hour problem has been whittled down to 20 seconds. Ye Ha! Thanks for all the great ideas! Brian ======================================================== Inline C version from BrowserUk: by BrowserUk on Aug 11, 2015 at 20:28 UTC This drops into Inline::C to explore the bitmap and construct the hash +. It takes 23 seconds to process a 125 megapixel image: #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef unsigned int U32; typedef unsigned char U8; HV* countColors( SV *img ) { STRLEN l; U32 *rgba = (U32*)SvPVx( img, l ), i; HV* counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { if( hv_exists( counts, (char*)&rgba[ i ], 4 ) ) { SV **val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 ); SvIV_set( *val, SvIV( *val ) + 1 ); } else { SV *val = newSViv( 1 ); hv_store( counts, (char*)&rgba[ i ], 4, val, 0 ); } } return counts; } END_C use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; use GD; GD::Image->trueColor(1); my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!; my $gd = $i->gd; my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $ +gd, 0, 11, '' ); printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd +) / 4; my $start = time; my $href = countColors( $gd ); printf "Took %f seconds\n", time() -$start; printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su +m( values %{ $href } );
Re: Perl Hashes in C?
by pme (Monsignor) on Aug 11, 2015 at 19:01 UTC
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 12, 2015 at 14:52 UTC

    BrianP,

    Is there a sample image on the Internet that demonstrates your requirements?

    The mathematics of your problem can't be duplicated easily, but an image that demonstrates your actually problem could give those of us with a math background the ability to test for a better solution.

    BrowserUk solution is excellent, but without data representing your problem, we're just guessing!

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      Ed,

      I am exploring alternate color spaces, bizarre processes and different printing options. I like the ProPhoto colorspace and find that I have many fewer problem spots when I use it; less burnouts, fewer jet black holes and better skin tones.

      A large percentage of the photo establishment insisting that no camera, monitor, printer or eyeball can see anything other than sRGB. It's easier to print, everything looks the same everywhere, the least common denominator makes your life easier.

      HOGWASH!

      I suspect that these people are keeping MacDonalds in business because it tastes the same everywhere. I am building analytic tools to measure the degradation that occurs with 20-year-old color spaces from a formerly large, low quality software empire.

      I don't know if you will be able to see this but it is a fairly colorful picture minimally processed in both ProPhoto and srgb, converted to 16bit/channel raw, subtracted by the uint16 and normalized. There is a mountain of difference.

      <<Link>> https://picasaweb.google.com/lh/photo/u4_oeKwuv6vXjOQesjBQri_P6Kt2i5G4X9d1GVZOYtg?feat=directlink

      This is one of the reasons I wanted this tool. I need to do zillions of these and 2 hours each was objectionable. These Perl hashes are astonishing in their raw power on big data.

      Crusading for Better Color,

      BrianP

Re: Perl Hashes in C?
by anonymized user 468275 (Curate) on Aug 12, 2015 at 18:59 UTC
    Long before Perl and C++ were invented, a hashing algorithm meant mapping the first so many bytes of the key to a starting address, these being designed to be spread evenly across the total available addressing space, and then adding the remainder of the key and the value in sequentially (this needs an end-pointer to be stored at the landing point). Old school hash lookup expects therefore to land immediately on the full match value most of the time and worst case have to trawl along sequentially to find it from there, but not too far to have performance impact. Somehow it looks like this problem would tempt me into going that far old school to achieve all the theoretical benefits, especially if straight C is an option.

    One world, one people

Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 13, 2015 at 11:33 UTC

    BrianP

    I downloaded the file, but it was in '.jpg' format and not a raw file. I used it in the following script, and then I copied it until the file size was larger than 217MB. The result of the 2 runs are below. I may be wrong but I don't think you need to convert to an integer and then use pack. The result will be the same as the raw 6 bytes ( 48 bits ) for the pixel.

    use strict; use warnings; my %Image = (); keys %Image = 4096 * 128; my $rdamt = 4096 * 6; my $buffer; # my $file = '249465.pprgb-srgb.absv.4000.cs.jpg'; # Original + File downloaded my $file = '249465.pprgb-srgb.absv.4000.cs.jpg2'; # New File + multiple copies my $compraw = -s $file; my $pixels = int ( $compraw / 6 ); open ( my $in, "<", "./$file") or die "$!\n"; while( 1 ) { my $size = sysread( $in, $buffer, $rdamt ); # No bufferi +ng if ( $size == 0 ) { last; } while( $buffer ) { if ( length( $buffer ) < 6 ) { last; } # Throw away + odd number of pixels my $key = substr( $buffer, 0, 6, '' ); $Image{$key}++; } } close $in; my $uniq = keys %Image; my $factor = sprintf("%.2f", $uniq / $pixels ); print "Found $uniq colors in $pixels pixels $factor\%\n"; __END__

    This is the results of run1, almost all unique pixels. About 1 second.

    > time pyrperl uniqcolors.plx Found 153601 colors in 153606 pixels 1.00% ( File size: 921,638 ) real 0m0.098s user 0m0.092s sys 0m0.008s

    This is using the larger file. We got some more uniques since the '.jpg' file was not a multiple of 6 bytes. About 15-16 seconds.

    > time pyrperl uniqcolors.plx Found 460775 colors in 41473710 pixels 0.01% ( File size: 248,842,260 + ) real 0m15.015s user 0m14.937s sys 0m0.080s

    Try it on your real data and let us know if it helps.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      Ed,

      Your hard core, direct, laser focused code got the answer spot on the first go in the delightfully brief time of < 37 seconds:

      Running C:\bin\bb.pl Thu Aug 13 11:06:26 2015 Found 27645898 colors in 36152320 pixels 0.76% Elapsed time = 36.82 sec

      Franklin would have marveled at the design efficiency.

      I have been attempting to arrive at the same 6 byte Quanta size for hours using (*!_))#% pack/unpack. UINT16s work perfectly as do UINT64s. How is it possible that nobody has ever thought of a UINT48?

      32 bits is too wimpy; 4.3GB is not enough. But 4.3G*4.3G BYTES is clearly OverTheTop! 18446744073709551616 ????

      Wouldn't 65536 * 4294967296 Bytes be just about right? Surely 281474976710656 B "is more memory than anybody will ever need"?? 281 TER! Has a ring to it.

      A 24 bit processor would be ideally suited for GRAPHICS PROCESSING.

      I hate to be a pest (but it does not usually stop me). While I still have a residual amount of hair left, might I ask you if you could point out my obvious comprehensional deficit with UNPACK?

      I have all 217MB in memory. 8 byte quanta are too large and 4 byte are too small so I am stuck with 2 byte type "S", uint16_t. The inane method it all I can get to work, BIT shifting and ANDing:

      @ushort=unpack("S*", $buf); # < Ushort[108456960] == RIGHT Number! for($ii=0; $ii < scalar @ushort; $ii+=3) { ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice MORONIC> $bs = $rr | ($gg << 16) | ($bb << 32); # << WORKS ;( $rgb2c{$bs}++; # Increment count for this color }

      This works, but as another Monk pointed out, finely slicing and dicing then bit shifting the diminutive chunks and ORing them back together is hardly as satisfying as using 6 byte, native Quanta.

      I usually need the individual colors so I need this type of code, just not here. This is a case of wanting to find my error rather than fixing a blocking bug.

      How hard can it be to unpack 3 of the units from my array and smash them into the $RGB key I need with NO MONKEY BUSINESS? I tried every permutation of type S I could think of. Type Q worked fine except that it gave 1 1/3 pixel at a time. Is there a way to Unpack 3 UINT16s at a time with UNPACK()??

      WORKS!> @q =unpack("Q*", $buf); $sq = scalar(@q) || -1; FAIL! @uint48=unpack("S3", $buf); $s48 = scalar(@uint48) || -1; FAIL! @uint48=unpack("S@3", $buf); $s48 = scalar(@uint48) || -1; FAIL! @uint48=unpack("S[3]", $buf); $s48 = scalar(@uint48) || -1 +; FAIL! @uint48=unpack("(SSS)*", $buf); $s48 = scalar(@uint48) || +-1;
      And other, Quixotic attempts at 48 BITness!

      If you can't UNPACK 'em, PACK THEM!

      I tried packing 3 shorts, a quad with a pair of NULLs chasing and many other schemes:

      #$quad=pack('Q', $rr, $gg, $bb, 0x0000); #$q2=pack('Q1', $rr, $gg, $bb, 0x0000); # Q2=0x0000000000000000 #$q4=pack('S4', $rr, $gg, $bb, 0x0000); # #$q5=pack("SSSS", $rr, $gg, $bb, 0x0000); # #$q3=pack('Q*', $rr, $gg, $bb, 0x0000); # Q3=0x0000000000000000 #$q4=pack("Q", $rr, $gg, $bb, 0x0000); # Q4=0x0000000000000000 #$q5=pack("S*", $rr, $gg, $bb, 0x0000); # Q5=0x0000000000000000 #$q5=pack("Q*", @ushort[$ii .. $ii+2]);

      I always got zero or some error or something unprintable.

      Obviously reading a buffer-full and carving 6 byte slices works. And, reading 3 uint16s and clumsily bit-stitching them together gets the job done. But reading the whole file and unpacking an entire array of finished products in 1 line would be the most elegant and likely the fastest.

      Where is DATATYPE "G"?

      @UINT48=unpack("G*", $buf); # NATIVE, 48BIT UNSIGNED GRAPHIC INTS!

      It is unlikely that either K or R had digital cameras offering RAW file output so they can be forgiven for overlooking the obvious utility of UINT48.

      Perhaps what K&R missed the Wall Gank can substantiate?

      Thank you, Brian

        BrianP,

          32 bits is too wimpy; 4.3GB is not enough. But 4.3G*4.3G BYTES is clearly OverTheTop! 18446744073709551616 ????
        You have to go back to the math. 8bit or 128bit machines can get the same answer, it's knowing how the bits need to be put together :-)

          Wouldn't 65536 * 4294967296 Bytes be just about right?
        For you: Yes, for me, 32bits are fine for 98% of my work. All of my servers have at least 16GB, and many have many times that amount. But I can use 32bit Perl for 98% of the work (smaller footprint), and 64 bit Perl for the rest. I also have 32bit Perl with 64bit Integers.

        You are used to working with decimal numbers, but pack/unpack can be used to convert between binary, octal, decimal and hexadecimal. To use 48bit RGB, just think of the 6 octets as 3 16 bit numbers. Then this works:

        my $myNum = 65000 * (2**32); my ( $R,$G,$B ) = unpack("nnn", $myNum ); print "\$myNum: $myNum\n(\$R,\$G,\$B): $R\t$G\t$B\n";
        A lot of monks here are better at the math than I, but I can hold my own most of the time!

        For the future, ask specific questions that show your problem and when possible show the code that's demonstrating the problem. For your initial problem, you didn't have to worry about endianess, but you may have to consider it if your working with different architectures.

        Good Luck...Ed

        Regards...Ed

        "Well done is better than well said." - Benjamin Franklin

Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 14, 2015 at 16:54 UTC

    BrianP,

    Maybe this helps?

    use strict; use warnings; # my $file = '249465.pprgb-srgb.absv.4000.cs.jpg'; # Original + File downloaded my $file = '249465.pprgb-srgb.absv.4000.cs.jpg2'; # New File + multiple copies my $compraw = -s $file; my $pixels = int ( $compraw / 6 ); my @UINT48; $UINT48[$pixels] = ''; ## Allocate size of array ( faster ) open ( my $in, "<", "./$file") or die "$!\n"; if ( 1 ) { my $loc = -1; my $size = sysread( $in, my $buffer, $compraw ); # No bu +ffering if ( $size ne $compraw ) { die "Disk error: $!\n" } while( $buffer ) { $UINT48[$loc++] = substr( $buffer, 0, 6, '' ); } } close $in; ## $buffer is gone! ## @UINT48 is populated with 6 byte RG +B strings my $uint48 = scalar @UINT48; print "\nImage has $uint48 pixels\n"; ## How to get the values back! my ( $R,$G,$B ) = unpack( "SSS", $UINT48[0] ); print "\t$R,$G,$B\n"; ( $R,$G,$B ) = unpack( "SSS", $UINT48[$pixels] ); print "\t$R,$G,$B\n"; __END__ > time pyrperl uniqcolors.plx Image has 41473711 pixels 17994,17993,256 55551,57599,4096 real 0m13.745s user 0m12.905s sys 0m0.840s >
    For you earlier question about reading from disk, Linux reads/writes in 4096 byte blocks. So I multiply the pagesize * 6 (since 4096 is not exactly divisible by 6). Use you system page size and multiply by a factor of 6 ( 6,12,...).

    If you want '$buffer' around after the populate cycle, just remove the "if ( 1 ){ }" or define '$buffer' before the loop,

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Perl Hashes in C? (just sort)
by Anonymous Monk on Aug 15, 2015 at 13:26 UTC

    Counting unique pixel values is the equivalent of sort | uniq. No hashing, no associative maps are necessary.

    A good solution involves picking the most suitable sort algorithm and implementation. One might bucket by one color, then mergesort and count on the 32bit values. Anyway, 30 million items single-threaded — this ought to be ~1 sec job.

      This stuff is absolutely not my forte but surely sorting is more expensive and slower than hashing unless memory use is thrashing... In any case, whatever it ought to be, others have given code and timings. :P

        Well... see Integer sorting. Sort can do better than n log n in specific circumstances. Whether it's sorting or hashing, the items are stashed in some location and the more you know about the distribution, the less you need to shuffle them around. Underneath, similar data structures may be used, indeed it may be hard to tell sorting and hashing apart here. (And remember, computer memory is linearly addressed which matches the single-dimensional sort ordering.)

        Mergesort is well amenable to parallel processing and SIMD streaming. It is practical and robust. Think of the same counting problem broken down sufficiently: you have a small strip of values, say couple hundred. Do you start hashing them or do you just run them through the pipeline?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1138218]
Front-paged by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2024-03-28 16:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found