Falkkin is working on a Java Huffman encoder...
Moved only by the will to serve, I propose him an example of what he could do if he did that in Perl... :-)
$"=$/=$_;map{$0{$_}++}split//,$_=<>;@0=map{[$_,$0{$_},$_]}keys%0; $/=1+1;until(@0==1){@0=sort{$$b[1]<=>$$a[1]}@0;$t=pop@0;$p=pop@0; push@0,[[$$t[0],$$p[0]],$$t[1]+$$p[1],$$t[$/].$$p[$/]];map{map{$t {$_}=$;%$/.$t{$_}}split//;$;++}$$t[$/],$$p[$/]}s/(.)/$t{$1}/gs;sub _{my$t=pop;if(ref$t){my@t=map{if(!ref$$t[$_]){1,unpack"b8",$$t[$_]} else{0,""}}0,1;"@t[0,$/,1]"._($$t[0])._($$t[1])."$t[-1]"}}print pack"sB*",length,_($0[0][0]).$_
(This was one of my "helpers" for OPC5)
Corresponding Huffman decoder to be posted soon.
|
---|
Replies are listed 'Best First'. | |
---|---|
Huffman encoder SOLUTION
by BooK (Curate) on Feb 20, 2001 at 21:48 UTC | |
Cet assombrissement est soumis au nom de Paris.pm canal assombri.
This program perform a Huffman compression to its STDIN or to the files given on the command-line. One drawback is that I was too lazy to write the code to output the compressed data of different files to different output files... HOW DOES IT WORK?First $/ and $" are undefined, so that the first use of the diamond operator slurps the whole file, and that the double-quoting of array(slice)s don't result in unwanted spaces. map is used instead of for because I like returning lists in a void context. Furthermore, it saves a few ()'s. Stéphane Payrard (of OPC3 fame) taught me that @0 is a perfectly valid array. That's a good enough reason to use it. The same goes for %0. Thanks to O::Deparse, I learned a few tricks like writing $a->[1] as $$a[1]... (Take this as a proof that one can learn something through obfuscation.) Naturally, I doubt that O::Deparse helped you much, since here it spits out:
Can't call method "sibling" on an undefined value at /usr/lib/perl5/5.6.0/i386-linux/B/Deparse.pm line 257. CHECK failed--call queue aborted. I must admit I don't know at all how it is possible. But, hey that's one more hurdle on your way to understanding this program. So, %0 holds the count for each character (the character is the key, the count is the value). Given the character, we create a array containing one array by character, containg: one sub-tree of the Huffman tree (at first, it's only a leaf (i.e. the character itself)), its weight, and a string containing all the sub-tree leaves. THE HUFFMAN TREEThe Huffman algorithm pairs the sub-trees by weight. As long as there are more than one sub-tree, pairs the lightest two into one sub-tree. I chose to have the lightest branches on the left (bit 0). 49 For a file containing: 27 A /\ 15 B 22 A 3 C /\ 1 D 7 B 1 E /\ 1 F C 4 1 G /\ 2 2 the resulting tree is (with the weight of each /\ /\ sub-tree noted at its root) drawn on the left. D EF G This tree is encoded at the beginning of the file in the following manner: 01 Two bits are used by node (in the previous example, /\ there are 6 nodes and 7 leaves), one for each branch. 01 A /\ 0 means non-terminal node and 1 means terminal node 10 B (i.e. there is a leaf hanging there). /\ C 00 Our tree looks like the one on the left. /\ 11 11 The recursive procedure that builds the tree /\ /\ represents the data in the following manner: D EF G [left-leaf][sub-tree][right-leaf] Applying this procedure (named _) recursively, we obtain: 010110C0011DE11FGBA (every character being replaced with its bit representation).
SO IT'S JUST A REGEX?While we build the tree (line 2 and 3), we also create a HASH %t which keys are the characters in the file to be compressed and which values are the bit string (the path in the tree). This path is constructed as we creare the tree itself. $; and $/ are used to create the needed 0's and 1's... $/ equals 2 (whose modulus helps), and $; is first undefined and is an even number each time we need it again in our map. Finally, we perform a substitution on the data, replacing each character by the bit string that represents the path in the tree. The compressed file is the concatenation of the length of the data (since pack will pad the file with null bits, we need this information for the decompressor), the dictionary (which is self-explaining: no need for its size), and the data itself. FINAL NOTEThis program was designed to have as little distinct characters as possible, and also so as some character appear really a lot. It was too difficult for me to write it so that the respective probability of each character would be a negative power of 2 (1/2**$k if you like), which is when Huffman encoding performs at its best. My goal was to write a compressor / decompressor all in one obfuscation. Alas, Huffman doesn't perform well enough (it's that lousy dictionary's fault!). OK, next time I'll try with Lempel-Ziv... Post-Scriptum:crazyinsomniac asked me to add this link to a site with explanations and a Java animation. | [reply] |
Re: Huffman encoder
by japhy (Canon) on Feb 20, 2001 at 22:29 UTC | |
become two PRECIOUS characters shorter:
japhy -- Perl and Regex Hacker | [reply] [d/l] [select] |
New Huffman Obfus
by demerphq (Chancellor) on Sep 16, 2001 at 05:11 UTC | |
My first solution is the equivelent to BooK's original encoder, ie it only encodes the message, without a header containing the information necessary to decode the message. It is 316 characters long (on a w32 box). My next solution is the same thing but this time with the header information. I call it obfu_enhuff.pl, it is 337 bytes long. The next is the corresponding decoder, as might be guessed I call it obfu_dehuff.pl, and it is 259 bytes long. The output of the encoder can be piped into this script and the result will be the original file. such as by perl obfu_enhuff.pl <obfu_enhuff.pl | perl obfu_dehuff.pl (usage on *nix machines might be little more graceful..) And for my last trick, I humbly present the combination of the two. This version I call obfu_huff.pl and is 622 bytes long. Like the others it uses STDIN and STDOUT but also takes an argument to determine if it decodes or encodes (no arguments at all). perl obfu_huff.pl <obfu_huff.pl | perl obfu_huff.pl -d Many thanks to the Obfu crew and especially BooK for the inspiration and the walk through of his implementation, especially in regard to the use of funky global variables. BTW, the code is set up for 80 columns, which means if you dont want it split you need to see your wrap code length to be 81 (go figure) in the User Settings to see all of them properly. Oh and apologies to any who think this is a waste of time as its already been done, for me it was a _great_ exercise (I think that there are imprtant lessons to be learned by doing an obfu), and I'm damn proud of it. (never even looked at BooK's Huffman Decoder!) :-)
Yves | [reply] [d/l] [select] |
by Anonymous Monk on Mar 21, 2012 at 09:06 UTC | |
| [reply] |
Re: Huffman encoder
by kh (Initiate) on Mar 03, 2001 at 19:30 UTC | |
| [reply] |
by BooK (Curate) on Mar 03, 2001 at 20:19 UTC | |
Well, uh, I also did a Lempel-Ziv encoder/decoder pair, but I discovered that it only worked with some processors only (it is dependent on the big endian/little endian stuff). Lempel-Ziv and its derivatives are not that difficult to code. Once you get the principle (which is explained in many web sites), it's quite easy. And please, don't assume code is functionnal: test it, read it, try to understand how it works. Don't just believe what those evil obfuscators say. :-) | [reply] |