http://www.perlmonks.org?node_id=952850

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I have millions of strings that are exactly 6 characters (bytes) long. Each position in the string can be one of 37 possible characters (space, 0-9 and A-Z). I need to very quickly compress these strings into the smallest space possible and also be able to expand them back again later as quickly as possible.

The implementation I have now uses 6 bits per character and compresses the string into 36 bits. I split the string in half (3 characters each) and use a hash look up to convert each half to a series of 1s and 0s. I concatenate the bitstrings together (with 4 bits of padding) and used pack('B*', $bitstr) to get from 6 bytes to 5.

I know that I can do better. 37 ^ 6 = 2_565_726_409 while 2 ^ 32 = 4_294_967_296. If I treated the string as a base 37 number, it would easily fit into a 32 bit integer (4 bytes). What I am at a loss for is how to quickly translate 0 into '      ' and 2_565_726_409 into 'ZZZZZZ';

My environment is perl 5.8.8 with no compiler so I am limited to pure perl modules only. What's the best that you can do?

Cheers - L~R

Replies are listed 'Best First'.
Re: Fast - Compact That String
by BrowserUk (Patriarch) on Feb 09, 2012 at 22:46 UTC

    How does 16 second/million (both ways) including sub call overhead compare?

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my @c = (' ', '0'..'9', 'A'..'Z' ); sub fromB37 { my $n = shift; join '', map { my $c = $n %37; $n = int( $n / 37 ); $c[ $c ]; } 1 .. 6; } my %c = map{ $c[ $_ ] => $_ } 0 .. 36; sub toB37 { my $n = 0; $n = $n * 37 + $c{$_} for reverse unpack '(a)*', $_[0]; return $n; } my $start = time; for ( 1 .. 1e6 ) { my $n = int( rand 37**6 ); $n == toB37( fromB37( $n ) ) or die $n; } printf "Took %.3f second\n", time() - $start; __END__ C:\test>junk45 Took 16.404 second

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Doubled the speed of the implementation:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my @c1 = (' ', '0'..'9', 'A'..'Z' ); sub fromB37 { my $n = shift; my $s = ' '; substr( $s, $_, 1, $c1[ $n%37 ] ), $n /= 37 for 0 .. 5; $s; } my @c2; $c2[ ord( $c1[ $_ ] ) ] = $_ for 0 .. 36; sub toB37 { my $n = 0; $n = $n * 37 + $c2[$_] for reverse unpack 'C*', $_[0]; $n; } my $start = time; for ( 1 .. 1e6 ) { my $s = fromB37( rand 37**6 ); } printf "fromB37 took %.3f seconds/million\n", time() - $start; $start = time; for ( 'AAAAA' .. 'CEXHN' ) { my $n = toB37( $_ ); } printf "toB37 took %.3f second/million\n", time() - $start; my @data = map int( rand 37**6 ), 1 .. 1e6; $start = time; $_ == toB37( fromB37( $_ ) ) or die $_ for @data; printf "Both ways took %.3f second/million\n", time() - $start; __END__ C:\test>junk45 fromB37 took 5.309 seconds/million toB37 took 3.234 second/million Both ways took 8.953 second/million

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        The method is pretty much optimal but a little loop unrolling and speed/memory tradeoff can again speed it up a bit (a bit over twice as fast for fromB37() and a good 50% faster for toB37()). Using a couple 100K extra, it probably depends on the CPU's cache size though. Only changed/extra parts:

        my @c3 = map { my $c = $_; (map { "$_$c" } @c1) } @c1; sub myFromB37 { my $n = shift; my $s = ' '; substr( $s, 0, 2, $c3[$n % 1369] ); $n /= 1369; substr( $s, 2, 2, $c3[$n % 1369] ); $n /= 1369; substr( $s, 4, 2, $c3[$n] ); $s; } my @c4; $c4[unpack 'S', $c3[$_]] = $_ foreach 0 .. 1368; sub myToB37 { 1874161 * $c4[unpack 'S', substr($_[0], 4, 2)] + 1369 * $c4[unpack 'S', substr($_[0], 2, 2)] + $c4[unpack 'S', substr($_[0], 0, 2)]; } my @num_data = map int( rand 37**6 ), 1 .. 1e6; my @asc_data = map " $_", 'AAAAA' .. 'CEXHN';

        The last lines pre-generate test data because my version depends on strings being no less then 6 characters; the rest of the benchmarking is analogous to yours. You could save one "reverse" in your version though by putting the characters in reverse order, which they should be anyway to represent a normal left-to-right base37 number.

        An explanation of the code as requested:

        ## Map the numbers, 0 .. 36 to the symbols we use ## to represent the number in base37 my @c1 = (' ', '0'..'9', 'A'..'Z' ); sub fromB37 { my $n = shift; ## Get the number to convert ## Allocate space for the Base37 representation ## Initialise it to the representation of 0 (six spaces) my $s = ' '; ## For each position in the string for( 0 .. 5 ) { ## extract the next base37 digit value ## look up its representaion character ## and assign it to the 'right place' i the string. substr( $s, $_, 1 ) = $c1[ $n%37 ] ); ## dividing by 37 effectively right-shifts ## the last digit's value out of the number $n /= 37; } $s; } my @c2; ## Map the ordinal values of the symbols ## to their numeric values (0 .. 37) ## The sparse array is faster than a hash $c2[ ord( $c1[ $_ ] ) ] = $_ for 0 .. 36; sub toB37 { my $n = 0; ## initialise our return value to 0 ## split the base37 representation ## into a list of the ordinal values of the symbols ## and reverse their order to match that produced by fromB37() for( reverse unpack 'C*', $_[0] ) { ## multiple the running total by 37 ## (effectively left-shifting the accumulator ## to accommodate the next digit.) ## and add value of the next base37 digit ## by looking it up in the mapping array $n = $n * 37 + $c2[ $_ ]; } ## return the accumulated value. $n; }

        As mbethke points out, these treat the base37 number in 'little-endian' fashion. This because you emphasised compression and speed, with no mention of needing to manipulate the compressed values numerically (sorting).

        To get the sortable, big-endian representation, you could use:

        my @c1 = (' ', '0'..'9', 'A'..'Z' ); sub fromB37 { my $n = shift; my $s = ' '; substr( $s, $_, 1, $c1[ $n%37 ] ), $n /= 37 for 5, 4, 3, 2, 1, 0; $s; } my @c2; $c2[ ord( $c1[ $_ ] ) ] = $_ for 0 .. 36; sub toB37 { my $n = 0; $n = $n * 37 + $c2[$_] for unpack 'C*', $_[0]; $n; }

        Which actually works out a bit quicker still, but not as fast as mbethke's unrolled version.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Fast - Compact That String
by CountZero (Bishop) on Feb 09, 2012 at 23:42 UTC
    Using Bit::Vector.
    use Modern::Perl; use Bit::Vector; use Data::Dump qw/dump/; my %conversion = ( ' ' => 0, 0 => 1, 1 => 2, 2 => 3, 3 => 4, 4 => 5, 5 => 6, 6 => 7, 7 => 8, 8 => 9, 9 => 10, A => 11, B => 12, C => 13, D => 14, E => 15, F => 16, G => 17, H => 18, I => 19, J => 20, K => 21, L => 22, M => 23, N => 24, O => 25, P => 26, Q => 27, R => 28, S => 29, T => 30, U => 31, V => 32, W => 33, X => 34, Y => 35, Z => 36, ); my %inverted; my $vector = Bit::Vector->new(6); for ( keys %conversion ) { $vector->from_Dec( $conversion{$_} ); $conversion{$_} = $vector->to_Bin(); $inverted{ $vector->to_Bin() } = $_; } while (<DATA>) { chomp; my $string = join '', map { $conversion{$_} } split //; my $vector = Bit::Vector->new_Bin( 36, $string ); my $bits = $vector->to_Bin; $bits =~ s/([01]{6})/$inverted{$1}/ge; say $bits; } __DATA__ 123456 ZZZZZZ ABCDEF
    I know, Bit::Vector is an XS-module, so it doesn't comply with your specs.

    On my old laptop, it reads a file of 1 million strings, converts them to the compacted format and reconverts them to the original format (to make sure it worked) in 100 seconds.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Fast - Compact That String
by tobyink (Canon) on Feb 09, 2012 at 23:06 UTC

    Here's my attempt. There's two versions of the functions here. The first uses multiplication and division to implement something along the lines of what you say, where each possible six character string is mapped to a the integers 0 .. 2_565_726_408, encoded as 4 bytes.

    The second deals with the first three bytes and the second three bytes separately, mapping each to an integer in the range 0 .. 50652, and encoding each to strings of length 2 bytes, 4 bytes altogether.

    Although I haven't benchmarked them, my gut tells me that the second is faster. Whatsmore, the second can run within a "use integer" block, which allows Perl to use fast integer maths. The first will not run within "use integer" because it sometimes overflows.

    #!/usr/bin/perl use strict; use warnings; { use bytes; my %lookup; my @reverse; my $scale; BEGIN { my $x = 0; my @chars = (' ', 0..9, 'A'..'Z'); keys %lookup = 65_536; # preallocate hash buckets for my $i (@chars) { for my $j (@chars) { for my $k (@chars) { $lookup{ $i.$j.$k } = $x; $reverse[$x++] = $i.$j.$k; } } } $scale = $x; } # Functions using multiplication... sub alphanum_to_bytes_M { my $head = $lookup{ substr($_[0], 0, 3) }; my $tail = $lookup{ substr($_[0], 3, 3) }; my $n = ($head * $scale) + $tail; pack(N => $n) } sub bytes_to_alphanum_M { my $n = unpack(N => $_[0]); my $head = int($n / $scale); my $tail = $n - ($head * $scale); $reverse[$head] . $reverse[$tail] } # Functions using bitshifting... { use integer; sub alphanum_to_bytes_B { my $head = $lookup{ substr($_[0], 0, 3) }; my $tail = $lookup{ substr($_[0], 3, 3) }; pack(nn => $head, $tail) } sub bytes_to_alphanum_B { join q{}, @reverse[ unpack(nn => $_[0]) ] } } # Function to pretty-print byte strings for display purposes... sub show_bytes { my ($str) = @_; sprintf 'bytes[%s]', join q{ }, map { sprintf('%02x', ord(substr($str, $_, 1))) } 0 .. length($str)-1 } } my @lines = <DATA>; print "MULTIPLICATION:\n"; foreach (@lines) { my $b; chomp; printf( "'%s' => '%s' => '%s'\n", $_, show_bytes($b = alphanum_to_bytes_M($_)), bytes_to_alphanum_M($b), ); } print "BIT SHIFTING:\n"; foreach (@lines) { my $b; chomp; printf( "'%s' => '%s' => '%s'\n", $_, show_bytes($b = alphanum_to_bytes_B($_)), bytes_to_alphanum_B($b), ); } __DATA__ 0 1 A Z ABCDEF ABCDEG ZAAAAA ZZZZZZ

      I've had some time to benchmark it now. Run on an input of one million strings, the multiplying version takes 35 seconds and the bitshifting version takes 28 seconds. This confirms my hunch that the bitshifting version is faster. By about 20% it seems.

      Out of curiosity, I tried CountZero's XS version on the same processor, expecting it to be faster. But it took 124 seconds: significantly slower than either of the pure Perl versions. I imagine that the overhead of object construction/destruction, along with its use of regular expressions slows it down.

      (PS: I know the so-called bitshifting version doesn't use the bitshift operators. My initial implementation used "N" as the template for pack and unpack and used bitshifting to separate out the head and tail parts. I tweaked it to use "nn", which allowed me to drop the bitshifting, but I've kept referring to it as the bitshifting version anyway.)

Re: Fast - Compact That String
by Eliya (Vicar) on Feb 09, 2012 at 22:12 UTC
    my @ch = (' ', '0'..'9', 'A'..'Z'); my $base = @ch; my $num = 2_565_726_408; my $s = ''; while ($num > 0) { $s = $ch[$num % $base] . $s; $num = int($num / $base); } print $s; # ZZZZZZ

    (for small numbers you could space-pad the string on the left, if you like)

    And the reciprocal procedure (string to number) could be

    my @val; $val[ ord($ch[$_]) ] = $_ for 0..$#ch; my $s = ... my $num = 0; my $m = 1; for (reverse split //, $s) { $num += $m * $val[ ord($_) ]; $m *= $base; } print $num;
Re: Fast - Compact That String
by zebedee (Pilgrim) on Feb 13, 2012 at 00:25 UTC
    My environment is perl 5.8.8 with no compiler so I am limited to pure perl modules only.

    You could, however, compile a (C/C++/asm/whatever) binary on a machine with a compiler and copy that binary across?

    Not exactly in the spirit of the challenge, I know! Just sayin' ...
      zebedee,
      Actually no - even that has problems. First, you need to be able to introduce a compiler. Of course this doesn't need to be the production environment but there is still a change control process that must be adhered to. The next problem you will face is that compiling C extensions for Perl using a compiler other than the one that compiled Perl usually doesn't work. This means either recompiling Perl or creating a stand alone executable and using system calls. These both have the same problems. Keep in mind that I fully recognize that these hurdles are not technical and are quite silly but one has to choose what they want to spend their time doing.

      Cheers - L~R

        Come now, your criteria were:

        "I need to very quickly compress these strings into the smallest space possible and also be able to expand them back again later as quickly as possible."

        and

        "My environment ... no compiler"

        No mention of change control or process there - you wanted raw speed and little space used and you couldn't compile on the production machine.

        But I can imagine the approach I was thinking of - spawning an external process - (and the communication with that process) would negate any speed advantages and I'm too lazy to prove it one way or another so I'll shut up and go away!