Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Efficient 7bit compression

by Limbic~Region (Chancellor)
on Mar 14, 2005 at 15:21 UTC ( [id://439303] : perlquestion . print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

hacker asked in this question how to compress relatively short strings. He had a unique goal in mind with specific requirements. Since the only way I could think to do it likely wouldn't meet his requirements, I provided an intentional obfu solution. The general idea is to use 7 bits to store each character. This gives a 12.5% reduction in size but only supports ASCII 0-127.

This wasn't as straight forward as I had guessed. I asked around in the CB if there was something I was missing which apparently I wasn't. diotalevi indicated that this was one of the first problems he tried solving in Perl. Traditional compression techniques fail because with short strings they end up being bigger than the original. When you can afford the runtime cost but not the disk space (apparently every MB had to be justified during the campaign he was working), 7bit strings start to look appealing.

I mentioned on Monday (today) that I would try to provide a more straight forward solution as well as multiple implementations. One for runtime speed (assuming RAM isn't a factor), one for RAM, and one that balances both.

I didn't do very well. I was hoping others would find the task fun and provide better solutions. Here is my solution optimized for memory.

sub shrink { my $str = ''; for ( 0 .. length( $_[0] ) - 1 ) { my $bit = 7 * $_; vec($str, $bit++, 1) = $_ for split //, unpack('b8', substr($_ +[0], $_, 1)); } return $str; } sub grow { my ($str, $chr) = ('', ''); my $end = length( $_[0] ) * 8; $end -= ($end % 7) + 1; for ( 0 .. $end ) { vec($chr, $_ % 7, 1) = vec($_[0], $_, 1); if ( $_ % 7 == 6 ) { vec($chr, 7, 1) = 0; $str .= $chr; $chr = ''; } } return $str; }

Cheers - L~R

Replies are listed 'Best First'.
Re: Efficient 7bit compression
by Roy Johnson (Monsignor) on Mar 14, 2005 at 15:49 UTC
    It's easy, but does it count as efficient to expand the strings this way?
    use strict; use warnings; my $str = 'abcdefghijklmnop'; sub shrink { my $bitstr = unpack('B*', $_[0]); $bitstr =~ s/.(.{7})/$1/g; pack('B*', $bitstr); } sub grow { my $bitstr = unpack('B*', $_[0]); $bitstr =~ s/(.{7})/0$1/g; pack('B*', $bitstr); } my $shrunk = shrink($str); printf "Was: %s, Is: %s\n", length($str), length($shrunk); printf "Restored: <%s>\n", grow $shrunk;

    Caution: Contents may have been coded under pressure.
      Roy Johnson,
      This looks similar to my unshown trading memory for time solution. The $bitstr will temporarily be 8 times larger than the original $str. I outlined 2 kinds of efficiency in my post (RAM & runtime). Now if there was a way to have our cake and eat it too.

      Cheers - L~R

        This has constant overhead by expanding only 8 chars at a time.
        sub shrink { my $packed = ''; while ($_[0] =~ /(.{1,8})/gs) { my $bitstr = unpack('B*', $1); $bitstr =~ s/.(.{7})/$1/g; $packed .= pack('B*', $bitstr); } $packed; } sub grow { my $unpacked = ''; while ($_[0] =~ /(.{1,7})/gs) { my $bitstr = unpack('B*', $1); $bitstr =~ s/(.{7})/0$1/g; $unpacked .= pack('B*', $bitstr); } $unpacked; }

        Caution: Contents may have been coded under pressure.
        I'm not sure how this fares in reguard to speed, but it only uses 2-5 scalars, tries to avoid generating intermidiate lists, and modifies their arguments inplace.
        sub shrink { my $dbit = 0; for (my $sbit = 0; $sbit < length($_[0]) * 8; $sbit++) { next if $sbit % 8 == 7; vec($_[0], $dbit++, 1) = vec($_[0], $sbit, 1); } my $dlen = length($_[0]) * 7 / 8; $dlen++ unless $dlen == int($dlen); $dlen = int($dlen); my $extra = length($_[0]) - $dlen; if ($extra > 0) { substr($_[0], $dlen, $extra, ''); } for (my $pbit = $dbit; $pbit < $dlen * 8; $pbit++) { vec($_[0], $pbit, 1) = 0; } } sub grow { my $sbit = int(length($_[0]) * 8 / 7) * 7 - 1; for (my $dbit = int(length($_[0]) * 8 / 7) * 8 - 1; $dbit >= 0; $d +bit--) { vec($_[0], $dbit, 1) = $dbit % 8 == 7 ? 0 : vec($_[0], $sbit-- +, 1); } }
        I have the nagging fealing there is a better way to implement ceil (near the middle of shrink).
        If padding the compressed string with 0 bits is not needed then the second for loop of shrink can be omitted to save (on average) 4 bits of time.
Re: Efficient 7bit compression
by talexb (Chancellor) on Mar 14, 2005 at 17:12 UTC

    I'm not sure if the original request was for 7-bit compression or whether it's for text strings, which theoretically use values 32-127. If the latter, you can take six values ranging from 1-96 and compress them into a five byte 'chunk' since:

    96^6 = 782,757,789,696 256^5 = 1,099,511,627,776

    However, there my be better compression rates available depending on what you know about the data ahead of time. I know I spent a happy two or three weeks researching this (in the late 80's, using Borland's Turbo C) -- it's a very neat area to research.

    One more thought -- if you've got a large enough sample size, you can start to encode pairs ('en', 't ') or triplets ('en ', 'e, '). It's fun stuff.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Efficient 7bit compression
by beauregard (Monk) on Mar 14, 2005 at 17:41 UTC
    You might also be able to get it down to a 5 or 6 bit encoding by pre-scanning the string and determining the lowest ASCII value and the range. The if that range is small enough, you can use fewer bits to encode. Of course, since hacker was trying to compress URL's, the range is usually going to be 73 chars (URL's always contain period (ASCII 46) and usually go up to 'w' (ASCII 119)), so 7-bits is about minimum. And by encoding the normalized value and bit count you just added two bytes...


Re: Efficient 7bit compression
by perlfan (Vicar) on Mar 14, 2005 at 15:36 UTC
    Traditional compression techniques fail because with short strings they end up being bigger than the original.

    I am not sure I understand how this would be the case. A worst case Huffman encoding could result in 0% compression, and I don't see how inflation is even possible. I am not familiar with other compression techniques, so I can't say that for *all* techniques.

    Nonetheless, this will surely be useful in some applications.
      Some compression algorithms build a dictionary of common substrings and replace the substring with the dictionary index. For example, if you have "My life if I buy a dog". The substrings "y " and "if" are repeated. So, if you replace them with a 1byte number and hoist the substrings into a dictionary somewhere else, you can shorten the string.

      Problem is two-fold. First, you have to either mark which substrings are in the dictionary or have all strings in the dictionary. Second, your dictionary costs a certain amount of space. So, it's only likely that large strings will sufficiently amortize the cost of a dictionary to realize savings. Smaller strings will not.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        Ok I hear ya, I was not taking into account the string specific character frequency table that is used to decode a Huffman encoded string - I guess once could get around that by using a general one based on some character usage criteria, but that would probably not meet their needs.
      You forgot to factor in the size of the dictionary. That's why short strings end up being bigger than the original. For long strings, the size of the dictionary becomes an irrelevant percentage.
      I was paraphrasing diotalevi and hacker experience. I do not know which specific technique they were using or why it resulted in larger strings (I am assuming bookkeeping that didn't result in any gain). The point was that the whatever methods were available/tried didn't help.

      Cheers - L~R

      In general, all lossless compression algorithms will produce output longer than the input for some values of the input.

      Here is a sketchy proof:
      there is a finite number of strings of length at most n. If none of them were compressed to a longer string, it would mean that their compressed values would be again the set of all strings of length at most n (i.e. the compression would be a permutation of said set). This implies that any string would be "compressed" to a string of the same length.
      In other words, the only 1-1 injective (i.e. lossless) mappings that never cause length to increase are those that keep length the same.

      Update: I changed 1-1 with injective, which makes the statement a bit stronger and more directly related to the sketch of proof. As a side note, I should point out that compression algorithms should tend to be 1-1 anyway, so as to avoid "wasting" short strings.
      As to MidLifeXis's comment, of course by "compression algorithm" I meant something which strictly compresses at least one input, as pointed out by tilly.



      The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket

        One small nit...

        In general, all lossless compression algorithms will produce output equal in length or longer than the input for some values of the input.

        Use an identity function as your compressor.

        Otherwise, good back of the napkin / handwavey proof.


Re: Efficient 7bit compression
by gam3 (Curate) on Mar 14, 2005 at 21:54 UTC
    You can get something closer to 30% compression using a hoffman coding.

    Code at

    Using only 4 urls, as example input, I was able to get between 34% and 40% compression.

    use strict; use lib qw(./); use Huffman qw(:All); use Data::Dumper; my @word = ( ';node_id=3333', '', '', ' +art=&start=&ie=utf-8&oe=utF-8&client=firefox&rls=org.mozilla:en-US:un +official', 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', ); # build tree my @paths = build_encodings(join('', @word)); my $path_hash = build_encodings_hash(join('', @word)); my $tree = build_tree(join('', @word)); use Data::Dumper; print Dumper $path_hash; for my $string (qw ( +agg/North-Carolina ) ) { my $encoded_str; my $str; foreach my $item (split('', $string)) { if (my $ele = $path_hash->{$item}) { $encoded_str .= $ele->{ENCODING}; } else { die "No encoding for $item\n"; } } print "String: $string\n"; print "Encoded:\n$encoded_str\n"; # converting this to a bit vector is left as a exercise. $encoded_str = $encoded_str; my $decoded = decode_str($encoded_str,@paths); print "Decoded:\n".$decoded."\n"; print "Stats: \n"; print "Original chars: ".length($string)."\n"; print "Original size (in bits): ".(8*length($string))."\n"; print "Encoded size (in bits): ".length($encoded_str)."\n"; print "Size reduction: %".((1-(length($encoded_str)/(8*length($str +ing))))*100)."\n"; }
    Original chars: 80
    Original size (in bits): 640
    Encoded size (in bits): 420
    Size reduction: %34.375
    Original chars: 34
    Original size (in bits): 272
    Encoded size (in bits): 161
    Size reduction: %40.8088235294118
    Original chars: 45
    Original size (in bits): 360
    Encoded size (in bits): 231
    Size reduction: %35.8333333333333
    A picture is worth a thousand words, but takes 200K.
Re: Efficient 7bit compression
by QM (Parson) on Mar 15, 2005 at 00:06 UTC
    Does it make any sense to incorporate several approaches, and make the system adaptive?

    Let's say we've done some benchmarking (which I haven't), and I have these systems, breakpoints and savings (negative is space saved):

    string length system 0-10 plain (+10%) 11-50 7bit packing (-10%) 51-100 huffman (-20%) 101- LZ (-35%)
    Now, the first 2 bits (or byte, if plain) indicates which system is being used. Note that the plain text system suffers because of the overhead of the leading indicator.

    Would a system like this work? What would the breakpoints be? Is it worthwhile to combine any elements together?

    Or is it the case that benchmarking should be used to find the system that gives the best performance on the typical data, and stick with that?

    Quantum Mechanics: The dreams stuff is made of