LZW is one of those algorithms that for some reason I was always afraid of. Huffman encoding made sense, but LZW always seemed mysterious and arcane, something that I would never be able to wrap my poor math skills around. Well, recently I had a look at it and it turns out to be quite approachable indeed. In fact, when you avoid some implementation specific details it becomes nearly simple. :-)

Long

LZW is interesting because its an adaptive technique, in that it doesn't require two passes over the data, and it doesnt need to transmit a decoding table as part of the compressed data. This issue causes static huffman encoding to be relatively useless for small files so avoiding it is nice. The encoding table is deterministically created during compression is such a way that during decompression the exact same table is constructed.

So how does it work?

LZW works by encoding variable length input strings as variable length (in terms of bits) numeric codes. Single chars are encoded as their ordinal value, with codes being assigned to new substrings as they are encountered. The first char is emitted as simply that, an 8 bit value. Following that point all codes are emitted in longer than 8 bit form (starting with 9 bits until the table grows to have 512 values, then to 10 bits etc.). Its important to keep this point clear in your head. The code is the same, but can be different lengths.

In more detail, we try to find the longest already coded substring (starting at the beginning) of the input stream. As soon as we find a substring that has not been coded we emit the last good code value we found and at the same time add the unknown substring to the table (and give it a code value as well). This leaves us with the single character that caused us to miss. Since our table is initialized with the normal charset, we always have a code for the remaining letter, and if we cant find a longer substring then that is what we will emit. This behaviour means that we always emit enough known codes to reproduce the new codes we are adding.

An example probably will help. Lets take the following string

when we run it through the compression we see output like this:
O > char : "A" READ: "BAABAAAAABBBBBBBBAAAAAAABBBBBBAAAAAAAABBBBBBBBBBAAAAAAAAAAAAAABBBBBBB +BBBBBBAAAAA". "AAAAAA\r\n" C>> 257 : "AB" O > 66 : 001000010 : "B" | C>> 258 : "BA" O > 65 : 001000001 : "A" | C>> 259 : "AA" O > 257 : 100000001 : "AB" | C>> 260 : "ABA" O > 259 : 100000011 : "AA" | C>> 261 : "AAA" O > 261 : 100000101 : "AAA" | C>> 262 : "AAAB" O > 66 : 001000010 : "B" | C>> 263 : "BB" O > 263 : 100000111 : "BB" | C>> 264 : "BBB" O > 264 : 100001000 : "BBB" | C>> 265 : "BBBB" O > 263 : 100000111 : "BB" | C>> 266 : "BBA" O > 261 : 100000101 : "AAA" | C>> 267 : "AAAA" O > 267 : 100001011 : "AAAA" | C>> 268 : "AAAAB" O > 265 : 100001001 : "BBBB" | C>> 269 : "BBBBB" O > 266 : 100001010 : "BBA" | C>> 270 : "BBAA" O > 267 : 100001011 : "AAAA" | C>> 271 : "AAAAA" O > 262 : 100000110 : "AAAB" | C>> 272 : "AAABB" O > 269 : 100001101 : "BBBBB" | C>> 273 : "BBBBBB" O > 265 : 100001001 : "BBBB" | C>> 274 : "BBBBA" O > 271 : 100001111 : "AAAAA" | C>> 275 : "AAAAAA" O > 275 : 100010011 : "AAAAAA" | C>> 276 : "AAAAAAA" O > 272 : 100010000 : "AAABB" | C>> 277 : "AAABBB" O > 273 : 100010001 : "BBBBBB" | C>> 278 : "BBBBBBB" O > 269 : 100001101 : "BBBBB" | C>> 279 : "BBBBBA" O > 276 : 100010100 : "AAAAAAA" | C>> 280 : "AAAAAAAA" O > 267 : 100001011 : "AAAA" | C>> 281 : "AAAA\r" O > 13 : 000001101 : "\r" | C>> 282 : "\r\n"

The first line indicates that the first char was output literally, we dont need nine bits for it. (In fact we really could output the first two bytes verbatim. compress doesnt however and the above is meant to be compliant with that.) The second to fourth lines obviously shows us whats left of the input, the fifth line shows that the string "AB" was added to the code table with id #257. The sixth line shows that code 66 was emitted, and that the string "BA" was added to the code table with id #258. Followed by code 65 being emitted and the string "AA" being added to the symbol table...

This sequence demonstrates the reason why no explicit table of encodings has to be transmitted. The first data emitted was 'A' followed by code 66 (for 'B') followed by code 65. And the new entries to the table are 'AB' 'BA' (well overlook 'AA' here for a moment). Each new entry is constructable by known data we have already received. It turns out however that there is a special case where we get confused on decompression: when the pattern matches KwKwK (where K is a code, (and thus represents a string) and w represents a character) the compression routine will emit a code before the decompression will have defined it. Luckily we still have sufficient information to reconstruct the code, so a special case handler can take care of it. The simple input string 'AAABBB' illustrates this:

O > char : "A" READ:"AABBB\r\n" C>> 257 : "AA" O > 257 : 100000001 : "AA" | C>> 258 : "AAB" O > 66 : 001000010 : "B" | C>> 259 : "BB" O > 259 : 100000011 : "BB" | C>> 260 : "BB\r" O > 13 : 000001101 : "\r" | C>> 261 : "\r\n" O > 10 : 000001010 : "\n" |

We first read in and output the char 'A', we then create the code 'AA' and then output it. That will trip up the decompression. However we can use the old code value and the old last char read to reconstruct things as the below shows:

First char: A Read: 257 undef Initializing current string for 257 ("A") from "A" and "" E<< Created: 257 : "AA" Read: 66 "B" E<< Created: 258 : "AAB" Read: 259 undef Initializing current string for 259 ("BB") from "B" and "B" E<< Created: 259 : "BB" Read: 13 "\r" E<< Created: 260 : "BB\r" Read: 10 "\n" E<< Created: 261 : "\r\n"

Dynamic encoding

One tricky bit of the compression is bit twiddling to handle the variable length output codes. In C this is a nasty mess of binary operators. In perl we can do it a lot more easily (but obviously less efficiently) by keeping our bit pattern in string format and then using pack to convert it to bytes at a later time. This means that we dont have to worry about bit offsets, we can simply concatenate the stringified form of the code to a buffer and let pack sort out the byte boundaries as it needs to. Likewise for reading the data in, we read in a chunk of data, convert it to a bit string, and then cut off chunks at the front as we go, converting the appropriate length back to a number for processing purposes.

Table OverFlow

Even though the underlying algorithm is dynamic, there does come a point where it becomes unfeasable to keep increasing the table size. LZW propper doesnt really say what to do in this case. compress however, will cease growing the table and simply output codes it knows, until the compression ration starts to degrade, whereapon it flushes the table and begins anew. This is where the special code #256 comes in. When this code is read by the decompressor it will wipe its string and code tables, so the compressor outputs it before it wipes the table. (It has to do it before otherwise the wrong number of bits are read on the next input_code() )

If compatibility to compress were not an issue then other options are available. I was thinking that switching to dynamic huffman encoding of the table when it fills up would be an interesting twist. Unfortunately ive not been able to get Vitters algorithm running in Perl (yet) and I havent yet found a good overview of Algorithm FGK to try it instead either.

Implementation

My implementation is fairly straight forward. The key routines are compress_fh, output_code, expand_fh, and input_code. The two "_code" routines handle the input buffers and the output buffers as well as for ensuring the correct number of bits per code are output or input. The other two routines handle the table maintainence and the rest.

The other routines are fairly straight forward, with the exception possibly of _init(). _init() is called when the table is wiped, as well as when compression or decompression begins (by init()).

One implementation detail that i've not bothered with is a way to deal with a serious deficiency in the LZW algorithm: it can potentially require a lot of memory. This is dealt with by observing that every code (above 256) is represented by another code and a character. So in the string tables (a hash in our case $self->{strs}), we need not store full strings for each code, but only the previous code plus the new char. This brings the size of the string tables down considerably but represents more added complexity than I felt was necessary for learning the algorithm. If this was added it would mean minor changes to the compression, but considerable changes to the decompression, requiring a walk thought the code/char combinations, and then reversing the resulting list for every output string.

Implementation Bugs

The way I am handling the output bit stream differs from compress. Ive not been able to work out what to do to make my current code compatible with compress in this way. From what I can tell the code below behaves pretty much the same as compress, producing the same file sizes etc, but for some reason a different bit code is being output. Anybody have any ideas on that, let me know.

Other Implementations

One of our own has placed Compress::LZW online. His variation is somewhat closer to the "pure" LZW implementation, in that it doesnt do dynamic code sizes (the algorithm can output specific sizes, but they are fixed at start time). I havent benchmarked or done any further investigation his code (mostly becuase I had written most of mine before I even saw his).

Thanks to diotalevi for his comments here which lead me to investigating compress in detail and implementing most of its behaviour (if only I could get the code output right...). And of course hatmoward for publishing his version as well :-) castaway was very kind in helping me with source files, and Zaxo pointed me at a working copy of compress, (ncompress from Debian), which helped a lot. Thanks.

See also Compress::LZW, Any news on Compress::LZW?, RFC: Compress::LZW, Dr Dobbs on LZW, Vitters Dynamic Huffman Algorithm 673, Data Compression and of course the accompanying source code which will be in LZW Demystified (the code).


---
demerphq

<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

Comment on LZW Demystified
Select or Download Code
LZW Demystified (the code)
by demerphq (Chancellor) on Jun 29, 2003 at 20:44 UTC

    And heres the code. It may be a bit scruffy right now. Theres lots of $DEBUG stuff going on that I normally would remove, except that its no fun to play with if you cant see what happening under the hood! $DEBUG=1 shows lots of diagnostics. $DEBUG>1 shows even more. If run directly as a script it will compress and expand itself under $DEBUG circumstances. Otherwise it should be usable now. I will add POD later, but for now I'll leave it be.

    Note that once the output code bug is fixed changing the magic number output should make it compress compatible. Currently it is not however.

    The part after the __END__ marker is the LZW algorithm in pseudo code from the Dr. Dobbs Journal article mentioned above.


    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
Re: LZW Demystified
by dws (Chancellor) on Jun 29, 2003 at 22:16 UTC
    The first char is emitted as simply that, an 8 bit value. Following that point all codes are emitted in longer than 8 bit form (starting with 9 bits until the table grows to have 512 values, then to 10 bits etc.).
    A historical side-note: Years back I was doing some patent analysis, and read the Unisys LZW patent carefully several times.

    Expanding the table to increasing the output byte width is Welch's improvement to the Lempel-Ziv compression algorithm, and this is what Unisys obtained a patent on. If you crippled your compression such that the table never grew past 256 values, you weren't infringing (in my "Not A Lawyer" opinion). Unisys allowed some FUD to surround this issue, not correcting the impression that it was GIF they held a patent on. Not so.

      Theres an interesting note in the debian ncompress package about this. I have no idea how much to weight to give it or whatnot, but is suspect that as long as nobody is offering this in a package for sale then they cant do anything about it.

      Also, only using 256 codes, wouldnt cripple the algorithm it would kill it dead completely. :-) Do you mean that if the table size is fixed at any aribitrary level then its ok?

      The following article from James A. Woods, one of the earlier authors of compress, explains its relationship to the Unisys patent on the LZW compression method: From uunet!zephyr.ens.tek.com!uw-beaver!mit-eddie!wuarchive!usc!ucsd!u +cbvax!agate!iacs!jaw Wed Aug 1 15:06:59 EDT 1990 Article: 1282 of gnu.misc.discuss Path: alembic!uunet!zephyr.ens.tek.com!uw-beaver!mit-eddie!wuarchive!u +sc!ucsd!ucbvax!gate!riacs!jaw From: jaw@riacs.edu (James A. Woods) Newsgroups: gnu.misc.discuss Subject: Sperry patent #4,558,302 does *not* affect 'compress' Keywords: data compression, algorithm, patent Message-ID: <1990Jul31.220935.1424@riacs.edu> Date: 31 Jul 90 22:09:35 GMT Organization: RIACS, NASA Ames Research Center Lines: 69 # "The chief defect of Henry King Was chewing little bits of string." -- Hilaire Belloc, Cautionary Tales [1907] As a co-author of 'compress' who has had contact with an attorney + for Unisys (nee Sperry), I would like to relay a very basic admission from + Unisys that noncommercial use of 'compress' is perfectly legal. 'Compress' i +s also commercially distributed by AT&T as part of Unix System 5 release 4, with no further restrictions placed upon the use of the binary, as far as I am aware. From conversations with Professor Abraham Lempel and others, it appears that neither AT&T, Sun Microsystems, Hewlett Packard, nor IBM are paying any sort of license fees to Unisys in conjunction with pate +nt #4,558,302. It may be true that some organizations are paying fees fo +r data compression technology licensed from one or more of the many hold +ers of compression patents, but this is all independent from 'compress'. In particular, I received a letter at NASA dated October 1, 1987 +from John B. Sowell of the Unisys law department, informing me for the firs +t time that some form of LZW was patented. I naturally expressed skepticism that an algorithm could be patented (a murky legal area which remains so), stated that 'compress' is not identical to LZW, and in fact was designed, developed, and distributed before the ink on the patent was dry. Several telephone conversations later, Mr. Sow +ell intimated that they would *not* seek any fees from users of 'compress' but instead were signing licensees for hardware implementations of LZW +. So, regardless of what you believe about a shady legal area, if a +nyone from Unisys contacts you to extract tribute for the use of 'compress', + please tell them that, first, it is not theirs to begin with, and, second, th +ere is someone who will testify in court about the conversation above. It is not even clear if anyone can "own" 'compress', since original de +veloper Spencer Thomas, myself, and others placed the code in the public domai +n long before the adoption of the Berne copyright convention. In light of the events above, it seems that the Free Software Foundation is being unduly paranoid about the use of 'compress'. Now I can well believe that FSF is more likely to be a legal target than a behemoth like AT&T, but if they are simply redistributing untouched free software developed years ago in the public sector, I see no problem. Aside: I am investigating, possibly for a case history to be recycled to USENET, the particulars of data compression patents. I am aware of the following patents: IBM's Miller-Wegman LZ variant, those of Telcor and ACT [losing candidates for the British Telecom mod +em standard], James A. Storer's work on limited lookahead as explicated i +n his text "Data Compression (methods and theory)", Computer Science Press, +1988, and the various patents pending associated with the Fiala and Greene CACM article of April, 1989 on textual substitution methods. If you have any lore, send it this way. Sincerely, James A. Woods NASA Ames Research Center (RIACS) jaw@riacs.edu (or ames!jaw) P.S. The algorithm patent issue certainly is a "topic A" at the momen +t. One useful reference is the review article by Anthony and Colwell -- "Litigating the Validity and Infringement of Software Patents" in Washington and Lee Law Review, volume 41, fall 1984. I know Robert Co +lwell personally. As a practicing patent attorney, he tells me that, at a m +inimum, use of an invention "for research purposes" is legitimate.

      ---
      demerphq

      <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
Re: LZW Demystified
by toma (Vicar) on Jun 30, 2003 at 05:19 UTC
    According to the League for Programming Freedom, "The United States LZW patent expired on June 20th, 2003. According to Unisys, the counterpart Canadian patent expires July 7, 2004, the counterpart patents in the United Kingdom, France, Germany and Italy expire June 18, 2004, and the Japanese counterpart patents expire June 20, 2004."

    So perhaps the time is near to put GIF support back into the current version of GD.

    It should work perfectly the first time! - toma

      See the note on the GD home page about the timetable for reintroduction of GIF.

      /J\