15-211 Fundamental Structures of Computer Science

Lempel-Ziv Compression 15-211Fundamental Structuresof Computer Science Feb. 24, 2005 Ananda Guna

Recap

Huffman Trees • Huffman Trees can be used to construct an optimal prefix code. • What does optimal mean? • Greedy algorithm to assemble a Huffman tree. • locally optimal steps to global optimization • Requires symbol frequencies. • read the file twice – counting and encoding

Huffman Encoding Process

Adaptive Huffman or Dynamic Huffman • Clearly, having to read the data twice (first for frequency count, then for actual compression) is a bit cumbersome. • Perhaps data is available in blocks (streaming data) • Can build an adaptive Huffman tree that adjusts itself as more frequency data become available.

Adaptive Huffman ctd.. • Mapping from source messages to code words based upon a running estimate of the source message probabilities • Change the tree to remain optimal for the current estimates • adaptive Huffman codes respond to locality • Requires only a single pass of the data

Beating Huffman How about beating the compression achieved by Huffman? Impossible! It produces an optimal prefix code. Right. But who says we have to use a prefix code?

Dictionary-Based Compression

Dictionary-based methods • Here is a simple idea: • Keep track of “words” that we have seen, and replace them with a code number when we see them again. • We can maintain dictionary entries • (word, code) • and make additions to the dictionary as we read the input file.

Lempel & Ziv (1977/78)

Fred Hacker’s algorithm… • Fred now knows what to do… ( <the-whole-file>, 1 ) Transmit 1, done.

Right? • Fred’s algorithm provides excellent compression, but… • …the receiver does not know what is in the dictionary! • And sending the dictionary is the same as sending the entire uncompressed file • Thus, we can’t decompress the “1”.

Hence… • …we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

LZW Compression: The Byte Version

Byte method LZW • We start with a trie that contains a root and n children • one child for each possible character • each child labeled 0…n • When we compress as before, by walking down the trie • but, after emitting a code and growing the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

LZW: Byte method example • Suppose our entire character set consists only of the four letters: • {a, b, c, d} • Let’s consider the compression of the string • baddad

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 Output:

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 a 4 Output: 1

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 d a 5 4 Output: 10

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 d a d 5 4 6 Output: 103

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 a d a d 5 4 6 7 Output: 1033

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 a d a d 5 4 6 7 Output: 10335

Byte LZW output • So, the input • baddad • compresses to • 10335 • which again can be given in bit form, just like in the binary method… • …or compressed again using Huffman

Byte LZW: Uncompress example • The uncompress step for byte LZW is the most complicated part of the entire process, but is largely similar to the binary method

Byte LZW: Uncompress example 10335 Input: ^ Dictionary: b a c d 0 1 2 3 Output:

Byte LZW: Uncompress example 10335 Input: ^ Dictionary: b a c d 0 1 2 3 b Output:

Byte LZW: Uncompress example 10335 Input: ^ Dictionary: b a c d 0 1 2 3 a 4 ba Output:

Byte LZW: Uncompress example 10335 Input: ^ Dictionary: b a c d 0 1 2 3 d a 5 4 bad Output:

Byte LZW: Uncompress example 10335 Input: ^ Dictionary: b a c d 0 1 2 3 d a d 5 4 6 badd Output:

Byte LZW: Uncompress example 10335 Input: ^ Dictionary: b a c d 0 1 2 3 a d a d 5 4 6 7 baddad Output:

LZW Byte method: An alternative presentation

Getting off the ground Suppose we want to compress a file containing only letters a, b, c and d. It seems reasonable to start with a dictionary a:0 b:1 c:2 d:3 At least we can then deal with the first letter. And the receiver knows how to start.

Growing pains Now suppose the file starts like so: a b b a b b … We scan the a, look it up and output a 0. After scanning the b, we have seen the word ab. So, we add it to the dictionary a:0 b:1 c:2 d:3 ab:4

Growing pains We output a 1 for the b. Then we get another b. a b b a b b … output 1, and add bb it to the dictionary a:0 b:1 c:2 d:3 ab:4 bb:5

So? Right, so far zero compression. But now we get a followed by b, and ab is in the dictionary a b b a b b … so we output 4, and put bab into the dictionary … d:3 ab:4 bb:5 ba:6 bab:7

And so on Suppose the input continues a b b a b b b b a … We output 5, and put bbb into the dictionary … ab:4 bb:5 ba:6 bab:7 bbb:8

More Hits As our dictionary grows, we are able to replace longer and longer blocks by short code numbers. a b b a b b b b a … 0 1 1 4 5 6 And we increase the dictionary at each step by adding another word.

More importantly • Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end. • Start with the same initialization, then • Read one code number after the other, look up the each one in the dictionary, and extend the dictionary as you go along.

Again: Extending We scan a sequence of symbols a1 a2 a3 …. ak where each prefix is in the dictionary. We stop when we fall out of the dictionary: a1 a2 a3 …. ak b

Again: Extending We output the code for a1 a2 a3 …. ak and put a1 a2 a3 …. ak b into the dictionary. Then we set a1 = b And start all over.

Sort of Let's take a closer look at an example. Assume alphabet {a,b,c}. The code for aabbaabb is 0 0 1 1 3 5. The decoding starts with dictionary a:0, b:1, c:2

Moving along The first 4 code words are already in D. 0 0 1 1 3 5 and produce output a a b b. As we go along, we extend D: a:0, b:1, c:2, aa:3, ab:4, bb:5 For the rest we get a a b b a a b b

Done We have also added to D: ba:6, aab:7 But these entries are never used. Everything is easy, since there is already an entry in D for each code number when we encounter it.

Is this it? Unfortunately, no. It may happen that we run into a code word without having an appropriate entry in D. But, it can only happen in very special circumstances, and we can manufacture the missing entry.

A Bad Run Consider input a a b b b a a ==> 0 0 1 5 3 After reading 0 0 1, D looks like this: a:0, b:1, c:2, aa:3, ab:4

Disaster The next code is 5, but it’s not in D. a:0, b:1, c:2, aa:3, ab:4 How could this have happened? Can we recover?

… narrowly averted • This problem only arises when • the input contains a substring …s  s … • s  was just added to the dictionary. • Here s is a single symbol, but  a (possibly empty) word.

… narrowly averted But then the fix is to output x + first(x) where x is the last decompressed word, and first(x) the first symbol of x. And, we also update the dictionary to contain this new entry.

Example • In our example we had • s = b • w = empty The output and new dictionary word is bb.

Another Example aabbbaabbaaabaababb ==> 0 0 1 5 3 6 7 9 5 Decoding (dictionary size: initial 3, final 11) a 0 a + 0 aa b + 1 ab bb - 5 bb aa + 3 bba bba + 6 aab aab + 7 bbaa aaba - 9 aaba bb + 5 aabab

15-211 Fundamental Structures of Computer Science