sub log2{ int( ( log( $_[0] ) + 1e-15 ) / log( 2 ) ) }
sub PP_EliasPack {
my $packed = '';
my $out = 0;
for my $num ( @_ ) {
my $len = log2( $num );
$out += $len;
vec( $packed, $out++, 1 ) = 1;
vec( $packed, $out++, 1 ) = ( $num & ( 1 << $_ ) ) ? 1 : 0
for 0 .. $len - 1;
}
return $packed;
}
sub PP_EliasUnpack {
my $packed = shift;
my $bits = length( $packed ) * 8;
my $in = 0;
my @unpacked;
while( $in < $bits ) {
my( $len, $num ) = ( 0 ) x 2;
$len++ while $in < $bits && vec( $packed, $in++, 1 ) == 0;
last if $in == $bits;
vec( $packed, $in++, 1 ) and $num |= ( 1 << $_ ) for 0 .. $len
+ - 1;
$num |= ( 1 << $len );
push @unpacked, $num;
}
return @unpacked;
}
And for reference, here is the results of my benchmark with that included:
C:\test>678848
Run with 15000 unique words in 1000 documents (Ave: 554 words/doc)
ASCII uses 4755336 bytes
W-BER uses 3196819 bytes
Binary uses 3279128 bytes
Elias uses 3980063 bytes
PP_Elias uses 3980063 bytes
1 trial of Packing ascii (10.203s total)
1 trial of Unpacking ascii (3.159s total)
1 trial of Packing W-BER (18.159s total)
1 trial of Unpacking W-BER (1.516s total)
1 trial of Packing binary (9.910s total)
1 trial of Unpacking binary (1.455s total)
1 trial of Packing Elias (13.613s total)
1 trial of Unpacking Elias (2.739s total)
1 trial of Packing PP_Elias (31.128s total)
1 trial of Unpacking PP_Elias (18.094s total)
Notice that PP_Elias achieves identical compression to Elias in C (as you'd expect :), but that it's twice as slow at packing, and nearly 10 times slower when unpacking.
If you want to continue with using Elias, and can't build the C version yourself, I could let you have it pre-built (for win32), but I have to wonder why you would when W-BER is faster and achieves better compression?
Also, I wonder if you saw Re^8: Byte allign compression in Perl.. where I demonstrated that you can have the DB do the selection for you using the schema I suggested way back when, in 0.312 of a second? For the record, I found a small optimisation in the schema that reduce that by a factor of 10, to 31 milliseconds.
So the DB does the selection, sends you just the data you need to do your proximity calculations, and does it all faster than you could pack a single integer. Interested?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|