Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Cet assombrissement est soumis au nom de Paris.pm canal assombri.

This program perform a Huffman compression to its STDIN or to the files given on the command-line. One drawback is that I was too lazy to write the code to output the compressed data of different files to different output files...

HOW DOES IT WORK?

First $/ and $" are undefined, so that the first use of the diamond operator slurps the whole file, and that the double-quoting of array(slice)s don't result in unwanted spaces. map is used instead of for because I like returning lists in a void context. Furthermore, it saves a few ()'s.

Stéphane Payrard (of OPC3 fame) taught me that @0 is a perfectly valid array. That's a good enough reason to use it. The same goes for %0.

Thanks to O::Deparse, I learned a few tricks like writing $a->[1] as $$a[1]... (Take this as a proof that one can learn something through obfuscation.) Naturally, I doubt that O::Deparse helped you much, since here it spits out:

Can't call method "sibling" on an undefined value at /usr/lib/perl5/5.6.0/i386-linux/B/Deparse.pm line 257.
CHECK failed--call queue aborted.

I must admit I don't know at all how it is possible. But, hey that's one more hurdle on your way to understanding this program.

So, %0 holds the count for each character (the character is the key, the count is the value). Given the character, we create a array containing one array by character, containg: one sub-tree of the Huffman tree (at first, it's only a leaf (i.e. the character itself)), its weight, and a string containing all the sub-tree leaves.

THE HUFFMAN TREE

The Huffman algorithm pairs the sub-trees by weight. As long as there are more than one sub-tree, pairs the lightest two into one sub-tree. I chose to have the lightest branches on the left (bit 0).

            49           For a file containing: 27 A
            /\                                  15 B
          22  A                                  3 C
          /\                                     1 D
         7  B                                    1 E
        /\                                       1 F
       C  4                                      1 G
         /\
        2  2             the resulting tree is (with the weight of each
       /\  /\            sub-tree noted at its root) drawn on the left.
      D  EF  G

This tree is encoded at the beginning of the file in the following manner:

            01           Two bits are used by node (in the previous example,
            /\           there are 6 nodes and 7 leaves), one for each branch.
          01  A
          /\             0 means non-terminal node and 1 means terminal node
        10  B            (i.e. there is a leaf hanging there).
        /\
       C 00              Our tree looks like the one on the left.
         /\
       11  11            The recursive procedure that builds the tree
       /\  /\            represents the data in the following manner:
      D  EF  G           [left-leaf][sub-tree][right-leaf]

Applying this procedure (named _) recursively, we obtain: 010110C0011DE11FGBA (every character being replaced with its bit representation).

SO IT'S JUST A REGEX?

While we build the tree (line 2 and 3), we also create a HASH %t which keys are the characters in the file to be compressed and which values are the bit string (the path in the tree). This path is constructed as we creare the tree itself. $; and $/ are used to create the needed 0's and 1's... $/ equals 2 (whose modulus helps), and $; is first undefined and is an even number each time we need it again in our map.

Finally, we perform a substitution on the data, replacing each character by the bit string that represents the path in the tree. The compressed file is the concatenation of the length of the data (since pack will pad the file with null bits, we need this information for the decompressor), the dictionary (which is self-explaining: no need for its size), and the data itself.

FINAL NOTE

This program was designed to have as little distinct characters as possible, and also so as some character appear really a lot. It was too difficult for me to write it so that the respective probability of each character would be a negative power of 2 (1/2**$k if you like), which is when Huffman encoding performs at its best. My goal was to write a compressor / decompressor all in one obfuscation. Alas, Huffman doesn't perform well enough (it's that lousy dictionary's fault!). OK, next time I'll try with Lempel-Ziv...

Post-Scriptum:

crazyinsomniac asked me to add this link to a site with explanations and a Java animation.


In reply to Huffman encoder SOLUTION by BooK
in thread Huffman encoder by BooK

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (4)
    As of 2014-07-31 03:55 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (244 votes), past polls