Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

LZW is one of those algorithms that for some reason I was always afraid of. Huffman encoding made sense, but LZW always seemed mysterious and arcane, something that I would never be able to wrap my poor math skills around. Well, recently I had a look at it and it turns out to be quite approachable indeed. In fact, when you avoid some implementation specific details it becomes nearly simple. :-)

Long

LZW is interesting because its an adaptive technique, in that it doesn't require two passes over the data, and it doesnt need to transmit a decoding table as part of the compressed data. This issue causes static huffman encoding to be relatively useless for small files so avoiding it is nice. The encoding table is deterministically created during compression is such a way that during decompression the exact same table is constructed.

So how does it work?

LZW works by encoding variable length input strings as variable length (in terms of bits) numeric codes. Single chars are encoded as their ordinal value, with codes being assigned to new substrings as they are encountered. The first char is emitted as simply that, an 8 bit value. Following that point all codes are emitted in longer than 8 bit form (starting with 9 bits until the table grows to have 512 values, then to 10 bits etc.). Its important to keep this point clear in your head. The code is the same, but can be different lengths.

In more detail, we try to find the longest already coded substring (starting at the beginning) of the input stream. As soon as we find a substring that has not been coded we emit the last good code value we found and at the same time add the unknown substring to the table (and give it a code value as well). This leaves us with the single character that caused us to miss. Since our table is initialized with the normal charset, we always have a code for the remaining letter, and if we cant find a longer substring then that is what we will emit. This behaviour means that we always emit enough known codes to reproduce the new codes we are adding.

An example probably will help. Lets take the following string

    "ABAABAAAAABBBBBBBBAAAAAAABBBBBBAAAAAAAABBBBBBBBBB". "AAAAAAAAAAAAAABBBBBBBBBBBBBAAAAAAAAAAA"
when we run it through the compression we see output like this:
O > char : "A" READ: "BAABAAAAABBBBBBBBAAAAAAABBBBBBAAAAAAAABBBBBBBBBBAAAAAAAAAAAAAABBBBBBB +BBBBBBAAAAA". "AAAAAA\r\n" C>> 257 : "AB" O > 66 : 001000010 : "B" | C>> 258 : "BA" O > 65 : 001000001 : "A" | C>> 259 : "AA" O > 257 : 100000001 : "AB" | C>> 260 : "ABA" O > 259 : 100000011 : "AA" | C>> 261 : "AAA" O > 261 : 100000101 : "AAA" | C>> 262 : "AAAB" O > 66 : 001000010 : "B" | C>> 263 : "BB" O > 263 : 100000111 : "BB" | C>> 264 : "BBB" O > 264 : 100001000 : "BBB" | C>> 265 : "BBBB" O > 263 : 100000111 : "BB" | C>> 266 : "BBA" O > 261 : 100000101 : "AAA" | C>> 267 : "AAAA" O > 267 : 100001011 : "AAAA" | C>> 268 : "AAAAB" O > 265 : 100001001 : "BBBB" | C>> 269 : "BBBBB" O > 266 : 100001010 : "BBA" | C>> 270 : "BBAA" O > 267 : 100001011 : "AAAA" | C>> 271 : "AAAAA" O > 262 : 100000110 : "AAAB" | C>> 272 : "AAABB" O > 269 : 100001101 : "BBBBB" | C>> 273 : "BBBBBB" O > 265 : 100001001 : "BBBB" | C>> 274 : "BBBBA" O > 271 : 100001111 : "AAAAA" | C>> 275 : "AAAAAA" O > 275 : 100010011 : "AAAAAA" | C>> 276 : "AAAAAAA" O > 272 : 100010000 : "AAABB" | C>> 277 : "AAABBB" O > 273 : 100010001 : "BBBBBB" | C>> 278 : "BBBBBBB" O > 269 : 100001101 : "BBBBB" | C>> 279 : "BBBBBA" O > 276 : 100010100 : "AAAAAAA" | C>> 280 : "AAAAAAAA" O > 267 : 100001011 : "AAAA" | C>> 281 : "AAAA\r" O > 13 : 000001101 : "\r" | C>> 282 : "\r\n"

The first line indicates that the first char was output literally, we dont need nine bits for it. (In fact we really could output the first two bytes verbatim. compress doesnt however and the above is meant to be compliant with that.) The second to fourth lines obviously shows us whats left of the input, the fifth line shows that the string "AB" was added to the code table with id #257. The sixth line shows that code 66 was emitted, and that the string "BA" was added to the code table with id #258. Followed by code 65 being emitted and the string "AA" being added to the symbol table...

This sequence demonstrates the reason why no explicit table of encodings has to be transmitted. The first data emitted was 'A' followed by code 66 (for 'B') followed by code 65. And the new entries to the table are 'AB' 'BA' (well overlook 'AA' here for a moment). Each new entry is constructable by known data we have already received. It turns out however that there is a special case where we get confused on decompression: when the pattern matches KwKwK (where K is a code, (and thus represents a string) and w represents a character) the compression routine will emit a code before the decompression will have defined it. Luckily we still have sufficient information to reconstruct the code, so a special case handler can take care of it. The simple input string 'AAABBB' illustrates this:

O > char : "A" READ:"AABBB\r\n" C>> 257 : "AA" O > 257 : 100000001 : "AA" | C>> 258 : "AAB" O > 66 : 001000010 : "B" | C>> 259 : "BB" O > 259 : 100000011 : "BB" | C>> 260 : "BB\r" O > 13 : 000001101 : "\r" | C>> 261 : "\r\n" O > 10 : 000001010 : "\n" |

We first read in and output the char 'A', we then create the code 'AA' and then output it. That will trip up the decompression. However we can use the old code value and the old last char read to reconstruct things as the below shows:

First char: A Read: 257 undef Initializing current string for 257 ("A") from "A" and "" E<< Created: 257 : "AA" Read: 66 "B" E<< Created: 258 : "AAB" Read: 259 undef Initializing current string for 259 ("BB") from "B" and "B" E<< Created: 259 : "BB" Read: 13 "\r" E<< Created: 260 : "BB\r" Read: 10 "\n" E<< Created: 261 : "\r\n"

Dynamic encoding

One tricky bit of the compression is bit twiddling to handle the variable length output codes. In C this is a nasty mess of binary operators. In perl we can do it a lot more easily (but obviously less efficiently) by keeping our bit pattern in string format and then using pack to convert it to bytes at a later time. This means that we dont have to worry about bit offsets, we can simply concatenate the stringified form of the code to a buffer and let pack sort out the byte boundaries as it needs to. Likewise for reading the data in, we read in a chunk of data, convert it to a bit string, and then cut off chunks at the front as we go, converting the appropriate length back to a number for processing purposes.

Table OverFlow

Even though the underlying algorithm is dynamic, there does come a point where it becomes unfeasable to keep increasing the table size. LZW propper doesnt really say what to do in this case. compress however, will cease growing the table and simply output codes it knows, until the compression ration starts to degrade, whereapon it flushes the table and begins anew. This is where the special code #256 comes in. When this code is read by the decompressor it will wipe its string and code tables, so the compressor outputs it before it wipes the table. (It has to do it before otherwise the wrong number of bits are read on the next input_code() )

If compatibility to compress were not an issue then other options are available. I was thinking that switching to dynamic huffman encoding of the table when it fills up would be an interesting twist. Unfortunately ive not been able to get Vitters algorithm running in Perl (yet) and I havent yet found a good overview of Algorithm FGK to try it instead either.

Implementation

My implementation is fairly straight forward. The key routines are compress_fh, output_code, expand_fh, and input_code. The two "_code" routines handle the input buffers and the output buffers as well as for ensuring the correct number of bits per code are output or input. The other two routines handle the table maintainence and the rest.

The other routines are fairly straight forward, with the exception possibly of _init(). _init() is called when the table is wiped, as well as when compression or decompression begins (by init()).

One implementation detail that i've not bothered with is a way to deal with a serious deficiency in the LZW algorithm: it can potentially require a lot of memory. This is dealt with by observing that every code (above 256) is represented by another code and a character. So in the string tables (a hash in our case $self->{strs}), we need not store full strings for each code, but only the previous code plus the new char. This brings the size of the string tables down considerably but represents more added complexity than I felt was necessary for learning the algorithm. If this was added it would mean minor changes to the compression, but considerable changes to the decompression, requiring a walk thought the code/char combinations, and then reversing the resulting list for every output string.

Implementation Bugs

The way I am handling the output bit stream differs from compress. Ive not been able to work out what to do to make my current code compatible with compress in this way. From what I can tell the code below behaves pretty much the same as compress, producing the same file sizes etc, but for some reason a different bit code is being output. Anybody have any ideas on that, let me know.

Other Implementations

One of our own has placed Compress::LZW online. His variation is somewhat closer to the "pure" LZW implementation, in that it doesnt do dynamic code sizes (the algorithm can output specific sizes, but they are fixed at start time). I havent benchmarked or done any further investigation his code (mostly becuase I had written most of mine before I even saw his).

Thanks to diotalevi for his comments here which lead me to investigating compress in detail and implementing most of its behaviour (if only I could get the code output right...). And of course hatmoward for publishing his version as well :-) castaway was very kind in helping me with source files, and Zaxo pointed me at a working copy of compress, (ncompress from Debian), which helped a lot. Thanks.

See also Compress::LZW, Any news on Compress::LZW?, RFC: Compress::LZW, Dr Dobbs on LZW, Vitters Dynamic Huffman Algorithm 673, Data Compression and of course the accompanying source code which will be in LZW Demystified (the code).


---
demerphq

<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

In reply to LZW Demystified by demerphq

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (6)
    As of 2014-12-25 12:56 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (160 votes), past polls