BrianP has asked for the wisdom of the Perl Monks concerning the following question:
I have a small Perl script that takes a crop spec with the XY position, width and height of a box, grabs the 48 bit RGB as a hash key and increments the data. It takes just a second on a ~100x100 pixel rectangle.
I use it to measure the quality of color spaces and various photo processes.
I wrote something much more clunky in C for use on entire, 216 MB pictures. But it takes 2.17 HOURS to run on a D800E pic with 27 Million unique colors. The lookup table made up of 64bit long longs gets to be 27E6 long toward the end and takes ~25 bsearch comparisons per lookup.
What I need is a good algorithm to take a 48 bit integer with quantomly random values and spit out an integer in the range 0-36E6. Perl has some pretty slick Hashing capabilities. I wonder whether any of them could be leveraged here.
GORY DETAILS: ----------
I have a C program which uses an array of 36 Million uint64s to hold 48 bits of RGB color data (16bits/channel) and 16 bits for a counter for the distinct colors found when scanning a 216 MB .RAW file.
For extremely colorful pictures, the length of this lookup table gets as high as 27 Million items. Even with a bsearch, it takes over 2 hours to count the number of pixels sharing each distinct color.
How does one HASH 48 bits of ~random data RGB data and return an integer in the range 36E6 to as high as many GB with 32GB of system RAM?
CURRENT PROCEDURE: -------------------------------
BSearch the known colors for the new color and increment the 16 bit counter if found. Else, append the new, unique colors to the end of the array and periodically QSort the new ones. Then I do a shuffle merge on the millions of old, sorted colors and the hundreds of new, colors (after qsorting the new) appending as many as possible from each array onto a merge array and go back and forth until old and new are all merged.
QSorting an array with millions of sorted and a few hundred unsorted seems like a waste. If bsearch fails on the sorted bottom, I do a linear search on the unsorted top and append the color if not found.
The BSearch comparison function masks off the highest 16 bits (0x0000FFFFFFFFFFFF) of *A and *B and returns 1 if *A & MASK > *B & MASK, -1 if <, else 0. Can't think of a faster way to compare 2 long longs from pointers.
What I need is a Perl Hash (just do it) and C speed; Cerl Language? tags hash raw rgb 48bit
Re: Perl Hashes in C?
by BrowserUk (Patriarch) on Aug 11, 2015 at 20:28 UTC
|
This drops into Inline::C to explore the bitmap and construct the hash. It takes 23 seconds to process a 125 megapixel image:
#! perl -slw
use strict;
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0;
typedef unsigned __int64 U64;
typedef unsigned int U32;
typedef unsigned char U8;
HV* countColors( SV *img ) {
STRLEN l;
U32 *rgba = (U32*)SvPVx( img, l ), i;
HV* counts = newHV();
l /= 4;
for( i = 0; i < l; ++i ) {
if( hv_exists( counts, (char*)&rgba[ i ], 4 ) ) {
SV **val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 );
SvIV_set( *val, SvIV( *val ) + 1 );
}
else {
SV *val = newSViv( 1 );
hv_store( counts, (char*)&rgba[ i ], 4, val, 0 );
}
}
return counts;
}
END_C
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];
use List::Util qw[ sum ];
use GD;
GD::Image->trueColor(1);
my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!;
my $gd = $i->gd;
my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $
+gd, 0, 11, '' );
printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd
+) / 4;
my $start = time;
my $href = countColors( $gd );
printf "Took %f seconds\n", time() -$start;
printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su
+m( values %{ $href } );
__END__
C:\test>1138218 mid.png
Width:12800 height:10240 pixels:131072000
Took 23.391081 seconds
Found 81814 colors in 131072000 pixels
The handling of the hash could be made more efficient by only calculating the hash-values once instead of several times; and by pre-extending the hash.
| [reply] [Watch: Dir/Any] [d/l] |
|
HV* countColors2( SV *img ) {
STRLEN l;
U32 *rgba = (U32*)SvPVx( img, l ), i;
HV* counts = newHV();
l /= 4;
for( i = 0; i < l; ++i ) {
SV **val;
if( val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 ) ) {
SvIV_set( *val, SvIV( *val ) + 1 );
}
else {
SV *val = newSViv( 1 );
hv_store( counts, (char*)&rgba[ i ], 4, val, 0 );
}
}
return counts;
}
C:\test>1138218 mid.png
Width:12800 height:10240 pixels:131072000
Took 12.642578 seconds
Found 81814 colors in 131072000 pixels
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
These are TrueColor. I am dealing with 281 TRILLION, full 16 bits/channel, 48 bits/pixel.
"Found 81814 colors in 131072000 pixels" -> 1602 Pixels per color.
The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M
"Pixels=36152321, unique Colors=27546248=76.19%"
76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.
I am monkeying with the MAX_UNSORTED parameter which determines when a sort has to be done after so many new, random colors have been piled on top of the lookup table.
I had it set at a way, way too low 200. I wrote a Perl script to run the C program with varying MAX_UNSORT numbers and are seeing vastly better performance with 3805 is the best so far. The linear searches on top of the pile are pretty cheap compared to QSorting and merging.
With a 1 in 3 sampling (12M of 36M), I have it down to < 46 seconds with 88.55% unique colors
The larger the number of unique colors, the more it pays to leave a pile of unsorted colors on top.
The one I did before was a ColorMAtch colorspace and it had ~76% unique colors. This one is ProPhoto and is over 85%! Same NEF file, same ACR settings, no photoshop other than to import from ACR and save as RAW.
It looks like I need to work on the Sort_Merge. QSort on the entire 27 million tall stack, 99% already sorted was taking 98% of the program time. The shuffle_merge is 100 times faster on this problem
| [reply] [Watch: Dir/Any] |
|
[0] Perl> $t = time; ++$h{$_} for 1 .. 27e6; print time() - $t;;
52.9575479030609
53 seconds!
76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.
Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that?
(If you mean something else by the highlighted phrase above, you're gonna have to explain yourself better.)
The beauty of hashing for this application is that it doesn't matter what the range is, only the actual total.
For each pixel in your image you either need to add a new key; or increment an existing value. Either takes approximately the same amount of time: circa: 0.000000457763671875 of a second on my rather ancient hardware.
Indexing your 48-bit values (as opposed to my 32-bit ones) will take ~50% longer; so perhaps 40 seconds to count the colours in my 125 mega-pixel image.
I have it down to < 46 seconds with 88.55% unique colors
If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time by asking this question?
Another monk that I won't waste my time reading and producing solutions for in future.
| [reply] [Watch: Dir/Any] [d/l] |
|
|
|
|
|
|
|
|
|
Re: Perl Hashes in C? (Updated)
by flexvault (Monsignor) on Aug 11, 2015 at 21:02 UTC
|
Welcome BrianP,
On a Linux 32 bit system, Perl still allows 2**53 integers if you compile Perl for 64 bit integers: (Note: If Perl is a 64 bit version, then the problem goes away!)
rperl -e 'use integer;$i=((2**17)**3);$m=2**53;print "$i\n$m\n";'
2251799813685248
9007199254740992
As you can see the maximum integer is larger than what you need.
So you can generate integers directly by multiplying the 3 16 bit
<UPDATE> To make sure nobody just multiplies the RGB together please remember that to represent the number 307 in decimal use:
perl -e '$n = (3*(10**2)) + (0*(10**1)) + (9*(10**0)); print "$n\n";'
So the formula is ($B is the base):
perl -e '$B = 10; $n = (3*($B**2)) + (0*($B**1)) + (9*($B**0)); print
+"$n\n";'
So to generate 48 bit RGB integers you need:
perl -e '$B = 2**16; $n = ($R*($B**2)) + ($G*($B**1)) + ($B*($B**0));
+print "$n\n";'
Clarification complete!
</UPDATE>
values and then use that value as a hash key. Then each time you add a count just do:
$hash{$key}++; ## $key is 48 bit RGB as a single integer
## The value $hash{$key} is sum of the time
+s found
Regards...Ed
"Well done is better than well said." - Benjamin Franklin
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Perl Hashes in C?
by FreeBeerReekingMonk (Deacon) on Aug 11, 2015 at 19:30 UTC
|
Stick to Perl, use the underlying C routines, and the hash capabilities of perl (unless you run out of memory)
Once you binary read a pixel and have 48 bits of data, it can be unpacked to a large integer. However, for speed, you should not use 3 times a 16 bit value, then shift them together, but use quad values, which only are available if your perl is 64 bit. Check Pack/Unpack Tutorial (aka How the System Stores Data)
But still, in perl, it would be better to unpack to hex strings, which are much shorter (thus faster), and use those as keys in a hash counter, like so:
my %SEEN;
my $key;
open( FH, '<', "image.raw" ) or die $!;
binmode(FH);
my $startpos = 10; # skip header
seek(FH, $startpos, 0);
$/ = \42; # set record size
while( <FH> ) {
$key = join("", unpack 'H42', $_ );
$SEEN{$key}++;
# not sure how to calculate stop
}
for my $key (keys %SEEN){
print "key $key seen $SEEN{$key} times\n";
}
# Add method to go from hex key to numeric rgb values if you want
| [reply] [Watch: Dir/Any] [d/l] |
|
I have PerlMagick read the file into a blob
Unpack the blob into a very long array of unsigned shorts
Calc the compound, multi-channel values for Yellow, Purple Cyan and White and hash them all
The array slice makes it a great deal faster than segmenting it into pixels
$im = Image::Magick->new(size=>$xyr, type=>'TrueColor', depth=>16);
$err=$im->Read(filename=>"RGB:$raw"); warn "$err" if $err; # Read RA
+W
($blob) = $im->ImageToBlob(); $bl=length $blob;
@ushort=unpack("S*", $blob);
$blen=length $blob; $ulen=scalar @ushort;
printf("CR: $lbl Blob len $blen unpacks to $ulen uint16s -> %d pix
+ -> %4.6f B/us\n",
$blen / 6, $blen/$ulen);
for($ii=0; $ii < scalar @ushort; $ii+=3) {
($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice
# Show full integers in 4Gig range for Yellow, purple, cyan
$yy=sprintf("%10d", $rr*$gg); # White=4294967296 (4,294,967,29
+6) 10 dig
$pp=sprintf("%10d", $rr*$bb);
$cc=sprintf("%10d", $gg*$bb);
# (2^16)^3 = 2^48 = 2.815E14, 15 sig figs MAX.
# Doubles have 52/53 bits of Significand
$tt=sprintf("%15d", $rr*$gg*$bb); # TRUE color, R*G*B???
$c2v2c{r}{$rr}++; # Count Red channel points with this value
$c2v2c{g}{$gg}++; # Count Gre channel points with this value
$c2v2c{b}{$bb}++; # Count Blu channel points with this value
$c2v2c{y}{$yy}++; # Count Blu channel points with this value
$c2v2c{p}{$pp}++; # Count Blu channel points with this value
$c2v2c{c}{$cc}++; # Count Blu channel points with this value
$c2v2c{t}{$tt}++; # Count Blu channel points with this value
}
$vt=0; # Values Total; sum of all values
foreach $chan (@leg) { # sort keys %c2v2c
$clr = $leg{$chan}; # y -> Yellow
%v2c = %{$c2v2c{$chan}}; # Values -> Count_of_this_value hash
$vc = scalar keys %v2c; # Distinct values for this channel
$vt += $vc; # Value Count accumulator
push @vc, "$clr $vc";
}
printf("CR: $lbl Chan Value Counts: %s, tot=$vt\n", join(", ", @vc
+));
return $vc; # Last ValueCount should be for T, TrueRGB
} # End Count_Rgb().
Looking at the log file:
3), R pf-2015.0531-249465.srgb.7360x4912.raw , crop_wh=577,428, blob=
+1481736 B -> 6.000000 B/p
CR: srgb Blob len 1481736 unpacks to 740868 uint16s -> 246956 pix -> 2
+.000000 B/us
CR: srgb Chan Value Counts: Red 18955, Green 16919, Blue 10862, Yellow
+ 236778, Purple 221564, Cyan 222509, TrueRGB 225558, tot=953145
Distinct RGB colors=225558 = 91.335% -> 1 dup clr / 11.54 pix
Elapsed time = 9.67 sec
IT does 4 of these in almost 10 seconds, but look at the pixel count; < 1/4 MegaPix
The hash only has to deal with 91% of 1/4 MB or 225,558 226 Thousand unique colors.
Scale this to 30 MILLION unique colors and the time factor goes from 9 seconds to 9 DAYS!
I have the C version down to 3 minutes flat with a
7920 Unsorted limit and 2.77 Million unique colors!
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
This was the code example from FreeBeerReekingMonk:
-----------------------------
my %SEEN;
my $key;
open( FH, '<', "image.raw" ) or die $!;
binmode(FH);
my $startpos = 10; # skip header
seek(FH, $startpos, 0);
$/ = \42; # set record size
while( <FH> ) {
$key = join("", unpack 'H42', $_ );
$SEEN{$key}++;
# not sure how to calculate stop
}
for my $key (keys %SEEN){
print "key $key seen $SEEN{$key} times\n";
}
# Add method to go from hex key to numeric rgb values if you want
=================================
I have no idea where the \42 came from. I changed it to a
record size=6. Then changed the Unpack from 'H42' to 'H12'
12 hex = 6 bytes.
It works:
# The RIGHT ANSWER IS: RGB hash has 27645898 distinct colors.
sub rgb_hash2() {
my %SEEN;
my $key; # 27645898 colors 216913920 B 36152320 pixels!
$raw="pf-2015.0531-249465.pprgb.7360x4912.raw"; #
$fs=-s $raw; # RAW file size in BYTES
open( FH, '<', "$raw" ) or die $!;
binmode(FH);
# RGB 16 BITS/CHANNEL = 48BITs/PIXEL =
6 BYTE QUANTA -> 12 HEX digits
$/ = \6; # set record size. REF 2 INT, "\6" == 6 B
#$/ = \42; # set record size
while( <FH> ) {
# Smash each 12 Hex rec into 48 BIT UINT48?
#$key = join("", unpack 'S3', $_ );
$key = join("", unpack 'H12', $_ );
$SEEN{$key}++;
}
close FH;
$sk=scalar keys %SEEN; # Scalar Keys. How many COLORS?
printf("Seen{%d}, file_size=%2.6fMB, -> %1.6f B/key\n",
$sk, $fs/1.0E6, $fs/$sk);
return $sk;
}
I also tried it with 3 Unsigned Shorts too.
=====================================================
=====================================================
Monk Ed suggested something at a bit lower level with a
sysread into a tiny 4096 * 6 byte buffer and nibbling
the buffer substr.
If you monkey with the buffer size and try a size not
evenly divisible by 6, it breaks. D'oh!
===============================================
Ed's ~ code (may be partially mangled by me).
%Image = (); keys %Image = 4096 * 128;
$rdamt = 4194304 unless $rdamt; # 4 MByte! 4096 * 6;
$compraw = -s $file; $pixels = int ( $compraw / 6 );
open ( $in, "<", "$file") or die "$!\n";
while( 1 )
{
$size = sysread( $in, $buffer, $rdamt );
if ( $size == 0 ) { last; }
while( $buffer )
{
if ( length( $buffer ) < 6 ) { last; }
$key = substr( $buffer, 0, 6, '' );
$Image{$key}++;
}
}
close $in;
$uniq = keys %Image;
===============================================
PERFORMANCE
SYSREAD with buffer size
Event 'Read_Raw_16384' elapsed time = 0.520 min = 10.43%.
Event 'Read_Raw_24576' elapsed time = 0.519 min = 10.41%.
Event 'Read_Raw_4096' elapsed time = 0.525 min = 10.54%.
10.54% -> Read_Raw_4096
10.43% -> Read_Raw_16384
10.41% -> Read_Raw_24576 << Ed's Magic Number 4096 * 6
Average ~= 0.5 min
Record read, unpack to HEX
Event 'Read_Raw_Hex0' elapsed time = 0.781 min = 15.68%.
Event 'Read_Raw_Hex1' elapsed time = 0.711 min = 14.26%.
Event 'Read_Raw_Hex2' elapsed time = 0.704 min = 14.13%.
Event 'Read_Raw_Hex3' elapsed time = 0.707 min = 14.18%.
15.68% -> Read_Raw_Hex0
14.26% -> Read_Raw_Hex1
14.18% -> Read_Raw_Hex3
14.13% -> Read_Raw_Hex2
Average ~= 0.7 min
Ed's magic number beat out the others tried by an eyelash!
Tried Unpacking 3 UShorts also. Seems like it should
be faster to unpack 3, 2 byte ints than 12 hex:
Event 'Read_Raw_UINT16x3Pack0' elapsed time=0.959 min=25.82%
Event 'Read_Raw_UINT16x3Pack1' elapsed time=0.919 min=24.76%
Event 'Read_Raw_UINT16x3Pack2' elapsed time=0.918 min=24.73%
Event 'Read_Raw_UINT16x3Pack3' elapsed time=0.917 min=24.70%
25.82% -> Read_Raw_UINT16x3Pack0
24.76% -> Read_Raw_UINT16x3Pack1
24.73% -> Read_Raw_UINT16x3Pack2
24.70% -> Read_Raw_UINT16x3Pack3
NO! Hex Unpack is faster that UINT16 on the same data
and by a sizable margin,
The low level sysreads without buffering Gets the GOLD MEDAL!
There was another INLINE C method from BrowserUk that looked
perfect, but it used GD with "TRUECOLOR" which on my system
means 24 bit and my data are all 16 so I didn't try it.
What started as a 2 hour problem has been whittled down to
20 seconds. Ye Ha!
Thanks for all the great ideas!
Brian
========================================================
Inline C version from BrowserUk:
by BrowserUk on Aug 11, 2015 at 20:28 UTC
This drops into Inline::C to explore the bitmap and construct the hash
+. It takes 23 seconds to process a 125 megapixel image:
#! perl -slw
use strict;
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0;
typedef unsigned __int64 U64;
typedef unsigned int U32;
typedef unsigned char U8;
HV* countColors( SV *img ) {
STRLEN l;
U32 *rgba = (U32*)SvPVx( img, l ), i;
HV* counts = newHV();
l /= 4;
for( i = 0; i < l; ++i ) {
if( hv_exists( counts, (char*)&rgba[ i ], 4 ) ) {
SV **val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 );
SvIV_set( *val, SvIV( *val ) + 1 );
}
else {
SV *val = newSViv( 1 );
hv_store( counts, (char*)&rgba[ i ], 4, val, 0 );
}
}
return counts;
}
END_C
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];
use List::Util qw[ sum ];
use GD;
GD::Image->trueColor(1);
my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!;
my $gd = $i->gd;
my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $
+gd, 0, 11, '' );
printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd
+) / 4;
my $start = time;
my $href = countColors( $gd );
printf "Took %f seconds\n", time() -$start;
printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su
+m( values %{ $href } );
| [reply] [Watch: Dir/Any] [d/l] |
Re: Perl Hashes in C?
by pme (Monsignor) on Aug 11, 2015 at 19:01 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 12, 2015 at 14:52 UTC
|
BrianP,
Is there a sample image on the Internet that demonstrates your requirements?
The mathematics of your problem can't be duplicated easily, but an image that demonstrates your actually problem could give those of us with a math background the ability to test for a better solution.
BrowserUk solution is excellent, but without data representing your problem, we're just guessing!
Regards...Ed
"Well done is better than well said." - Benjamin Franklin
| [reply] [Watch: Dir/Any] |
|
Ed,
I am exploring alternate color spaces, bizarre processes and
different printing options. I like the ProPhoto colorspace and
find that I have many fewer problem spots when I use it; less
burnouts, fewer jet black holes and better skin tones.
A large percentage of the photo establishment insisting that
no camera, monitor, printer or eyeball can see anything other
than sRGB. It's easier to print, everything looks the same
everywhere, the least common denominator makes your life easier.
HOGWASH!
I suspect that these people are keeping MacDonalds
in business because it tastes the same everywhere. I am building
analytic tools to measure the degradation that occurs with
20-year-old color spaces from a formerly large, low quality
software empire.
I don't know if you will be able to see this but it is a fairly
colorful picture minimally processed in both ProPhoto and srgb,
converted to 16bit/channel raw, subtracted by the uint16 and
normalized. There is a mountain of difference.
<<Link>>
https://picasaweb.google.com/lh/photo/u4_oeKwuv6vXjOQesjBQri_P6Kt2i5G4X9d1GVZOYtg?feat=directlink
This is one of the reasons I wanted this tool. I need to do
zillions of these and 2 hours each was objectionable. These
Perl hashes are astonishing in their raw power on big data.
Crusading for Better Color,
BrianP
| [reply] [Watch: Dir/Any] |
Re: Perl Hashes in C?
by anonymized user 468275 (Curate) on Aug 12, 2015 at 18:59 UTC
|
Long before Perl and C++ were invented, a hashing algorithm meant mapping the first so many bytes of the key to a starting address, these being designed to be spread evenly across the total available addressing space, and then adding the remainder of the key and the value in sequentially (this needs an end-pointer to be stored at the landing point). Old school hash lookup expects therefore to land immediately on the full match value most of the time and worst case have to trawl along sequentially to find it from there, but not too far to have performance impact. Somehow it looks like this problem would tempt me into going that far old school to achieve all the theoretical benefits, especially if straight C is an option.
| [reply] [Watch: Dir/Any] |
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 13, 2015 at 11:33 UTC
|
BrianP
I downloaded the file, but it was in '.jpg' format and not a raw file. I used it in the following script, and then I copied it until the file size was larger than 217MB. The result of the 2 runs are below. I may be wrong but I don't think you need to convert to an integer and then use pack. The result will be the same as the raw 6 bytes ( 48 bits ) for the pixel.
use strict;
use warnings;
my %Image = (); keys %Image = 4096 * 128;
my $rdamt = 4096 * 6;
my $buffer;
# my $file = '249465.pprgb-srgb.absv.4000.cs.jpg'; # Original
+ File downloaded
my $file = '249465.pprgb-srgb.absv.4000.cs.jpg2'; # New File
+ multiple copies
my $compraw = -s $file; my $pixels = int ( $compraw / 6 );
open ( my $in, "<", "./$file") or die "$!\n";
while( 1 )
{
my $size = sysread( $in, $buffer, $rdamt ); # No bufferi
+ng
if ( $size == 0 ) { last; }
while( $buffer )
{ if ( length( $buffer ) < 6 ) { last; } # Throw away
+ odd number of pixels
my $key = substr( $buffer, 0, 6, '' );
$Image{$key}++;
}
}
close $in;
my $uniq = keys %Image;
my $factor = sprintf("%.2f", $uniq / $pixels );
print "Found $uniq colors in $pixels pixels $factor\%\n";
__END__
This is the results of run1, almost all unique pixels. About 1 second.
> time pyrperl uniqcolors.plx
Found 153601 colors in 153606 pixels 1.00% ( File size: 921,638 )
real 0m0.098s
user 0m0.092s
sys 0m0.008s
This is using the larger file. We got some more uniques since the '.jpg' file was not a multiple of 6 bytes. About 15-16 seconds.
> time pyrperl uniqcolors.plx
Found 460775 colors in 41473710 pixels 0.01% ( File size: 248,842,260
+ )
real 0m15.015s
user 0m14.937s
sys 0m0.080s
Try it on your real data and let us know if it helps.
Regards...Ed
"Well done is better than well said." - Benjamin Franklin
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Running C:\bin\bb.pl Thu Aug 13 11:06:26 2015
Found 27645898 colors in 36152320 pixels 0.76%
Elapsed time = 36.82 sec
Franklin would have marveled at the design efficiency.
I have been attempting to arrive at the same 6 byte Quanta
size for hours using (*!_))#% pack/unpack. UINT16s work
perfectly as do UINT64s. How is it possible that nobody has
ever thought of a UINT48?
32 bits is too wimpy; 4.3GB is not enough. But 4.3G*4.3G BYTES
is clearly OverTheTop! 18446744073709551616 ????
Wouldn't 65536 * 4294967296 Bytes be just about right?
Surely 281474976710656 B "is more memory than anybody will
ever need"?? 281 TER! Has a ring to it.
A 24 bit processor would be ideally suited for GRAPHICS
PROCESSING.
I hate to be a pest (but it does not usually stop me). While I
still have a residual amount of hair left, might I ask you if
you could point out my obvious comprehensional deficit with
UNPACK?
I have all 217MB in memory. 8 byte quanta are too large and 4
byte are too small so I am stuck with 2 byte type "S", uint16_t.
The inane method it all I can get to work, BIT shifting and
ANDing:
@ushort=unpack("S*", $buf); # < Ushort[108456960] == RIGHT Number!
for($ii=0; $ii < scalar @ushort; $ii+=3) {
($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice
MORONIC> $bs = $rr | ($gg << 16) | ($bb << 32); # << WORKS ;(
$rgb2c{$bs}++; # Increment count for this color
}
This works, but as another Monk pointed out, finely slicing and
dicing then bit shifting the diminutive chunks and ORing them back
together is hardly as satisfying as using 6 byte, native Quanta.
I usually need the individual colors so I need this type of code,
just not here. This is a case of wanting to find my error rather
than fixing a blocking bug.
How hard can it be to unpack 3 of the units from my array and
smash them into the $RGB key I need with NO MONKEY BUSINESS?
I tried every permutation of type S I could think of. Type
Q worked fine except that it gave 1 1/3 pixel at a time.
Is there a way to Unpack 3 UINT16s at a time with UNPACK()??
WORKS!> @q =unpack("Q*", $buf); $sq = scalar(@q) || -1;
FAIL! @uint48=unpack("S3", $buf); $s48 = scalar(@uint48) || -1;
FAIL! @uint48=unpack("S@3", $buf); $s48 = scalar(@uint48) || -1;
FAIL! @uint48=unpack("S[3]", $buf); $s48 = scalar(@uint48) || -1
+;
FAIL! @uint48=unpack("(SSS)*", $buf); $s48 = scalar(@uint48) ||
+-1;
And other, Quixotic attempts at 48 BITness!
If you can't UNPACK 'em, PACK THEM!
I tried packing 3 shorts, a quad with a pair of NULLs chasing and many
other schemes:
#$quad=pack('Q', $rr, $gg, $bb, 0x0000);
#$q2=pack('Q1', $rr, $gg, $bb, 0x0000); # Q2=0x0000000000000000
#$q4=pack('S4', $rr, $gg, $bb, 0x0000); #
#$q5=pack("SSSS", $rr, $gg, $bb, 0x0000); #
#$q3=pack('Q*', $rr, $gg, $bb, 0x0000); # Q3=0x0000000000000000
#$q4=pack("Q", $rr, $gg, $bb, 0x0000); # Q4=0x0000000000000000
#$q5=pack("S*", $rr, $gg, $bb, 0x0000); # Q5=0x0000000000000000
#$q5=pack("Q*", @ushort[$ii .. $ii+2]);
I always got zero or some error or something unprintable.
Obviously reading a buffer-full and carving 6 byte slices works.
And, reading 3 uint16s and clumsily bit-stitching them together
gets the job done. But reading the whole file and unpacking an
entire array of finished products in 1 line would be the most
elegant and likely the fastest.
Where is DATATYPE "G"?
@UINT48=unpack("G*", $buf); # NATIVE, 48BIT UNSIGNED GRAPHIC INTS!
It is unlikely that either K or R had digital cameras offering RAW
file output so they can be forgiven for overlooking the obvious
utility of UINT48.
Perhaps what K&R missed the Wall Gank can substantiate?
Thank you,
Brian
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
BrianP,
32 bits is too wimpy; 4.3GB is not enough. But 4.3G*4.3G BYTES
is clearly OverTheTop! 18446744073709551616 ????
You have to go back to the math. 8bit or 128bit machines can get the same answer, it's knowing how the bits need to be put together :-)
Wouldn't 65536 * 4294967296 Bytes be just about right?
For you: Yes, for me, 32bits are fine for 98% of my work. All of my servers have at least 16GB, and many have many times that amount.
But I can use 32bit Perl for 98% of the work (smaller footprint), and 64 bit Perl for the rest. I also have 32bit Perl with 64bit Integers.
You are used to working with decimal numbers, but pack/unpack can be used to convert between binary, octal, decimal and hexadecimal. To use 48bit RGB, just think of the 6 octets as 3 16 bit numbers. Then this works:
my $myNum = 65000 * (2**32);
my ( $R,$G,$B ) = unpack("nnn", $myNum );
print "\$myNum: $myNum\n(\$R,\$G,\$B): $R\t$G\t$B\n";
A lot of monks here are better at the math than I, but I can hold my own most of the time!
For the future, ask specific questions that show your problem and when possible show the code that's demonstrating the problem. For your initial problem, you didn't have to worry about endianess, but you may have to consider it if your working with different architectures.
Good Luck...Ed
Regards...Ed
"Well done is better than well said." - Benjamin Franklin
| [reply] [Watch: Dir/Any] [d/l] |
|
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 14, 2015 at 16:54 UTC
|
use strict;
use warnings;
# my $file = '249465.pprgb-srgb.absv.4000.cs.jpg'; # Original
+ File downloaded
my $file = '249465.pprgb-srgb.absv.4000.cs.jpg2'; # New File
+ multiple copies
my $compraw = -s $file; my $pixels = int ( $compraw / 6 );
my @UINT48;
$UINT48[$pixels] = ''; ## Allocate size of array ( faster )
open ( my $in, "<", "./$file") or die "$!\n";
if ( 1 )
{ my $loc = -1;
my $size = sysread( $in, my $buffer, $compraw ); # No bu
+ffering
if ( $size ne $compraw ) { die "Disk error: $!\n" }
while( $buffer )
{ $UINT48[$loc++] = substr( $buffer, 0, 6, '' );
}
}
close $in; ## $buffer is gone!
## @UINT48 is populated with 6 byte RG
+B strings
my $uint48 = scalar @UINT48;
print "\nImage has $uint48 pixels\n";
## How to get the values back!
my ( $R,$G,$B ) = unpack( "SSS", $UINT48[0] );
print "\t$R,$G,$B\n";
( $R,$G,$B ) = unpack( "SSS", $UINT48[$pixels] );
print "\t$R,$G,$B\n";
__END__
> time pyrperl uniqcolors.plx
Image has 41473711 pixels
17994,17993,256
55551,57599,4096
real 0m13.745s
user 0m12.905s
sys 0m0.840s
>
For you earlier question about reading from disk, Linux reads/writes in 4096 byte blocks. So I multiply the pagesize * 6 (since 4096 is not exactly divisible by 6). Use you system page size and multiply by a factor of 6 ( 6,12,...).
If you want '$buffer' around after the populate cycle, just remove the "if ( 1 ){ }" or define '$buffer' before the loop,
Regards...Ed
"Well done is better than well said." - Benjamin Franklin
| [reply] [Watch: Dir/Any] [d/l] |
Re: Perl Hashes in C? (just sort)
by Anonymous Monk on Aug 15, 2015 at 13:26 UTC
|
Counting unique pixel values is the equivalent of sort | uniq. No hashing, no associative maps are necessary.
A good solution involves picking the most suitable sort algorithm and implementation. One might bucket by one color, then mergesort and count on the 32bit values. Anyway, 30 million items single-threaded — this ought to be ~1 sec job.
| [reply] [Watch: Dir/Any] [d/l] |
|
This stuff is absolutely not my forte but surely sorting is more expensive and slower than hashing unless memory use is thrashing... In any case, whatever it ought to be, others have given code and timings. :P
| [reply] [Watch: Dir/Any] |
|
Well... see Integer sorting. Sort can do better than n log n in specific circumstances. Whether it's sorting or hashing, the items are stashed in some location and the more you know about the distribution, the less you need to shuffle them around. Underneath, similar data structures may be used, indeed it may be hard to tell sorting and hashing apart here. (And remember, computer memory is linearly addressed which matches the single-dimensional sort ordering.)
Mergesort is well amenable to parallel processing and SIMD streaming. It is practical and robust. Think of the same counting problem broken down sufficiently: you have a small strip of values, say couple hundred. Do you start hashing them or do you just run them through the pipeline?
| [reply] [Watch: Dir/Any] |
|
|