15-211 Fundamental Data Structures and Algorithms

15-211Fundamental Data Structures and Algorithms LZW Compression Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee

Last Time…

Problem: data compression • Convert a string into a shorter string. • Lossless – represents exactlythe same information. • Lossy – approximates the original information. • Uses of compression: • Images over the web: JPEG • Music: MP3 • General-purpose: ZIP, GZIP, JAR, …

Huffman trees

Huffman’s algorithm • Huffman’s algorithm gives the optimal prefix code. • For a nice online demo, see • http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html

Huffman compression • Huffman trees provide a straightforward method for file compression. • 1. Scan the file and compute frequencies • 2. Build the code tree • 3. Write code tree to the output file as a header • 4. Scan input, encode, and write into the output file

Huffman decompression • Read the header in the compressed file, and build the code tree • Read the rest of the file, decode using the tree • Write to output

Beating Huffman • How about doing better than Huffman! • Impossible! • Huffman’s algorithm gives the optimal prefix code! • Right. • But who says we have to use a prefix code?

Example • Suppose we have a file containing • abcdabcdabcdabcdabcdabcd… abcdabcd • This could be expressed very compactly as • abcd^1000

Dictionary-Based Compression

Dictionary-based methods • Here is a simple idea: • Keep track of “words” that we have seen, and replace them with a code number when we see them again. • The code is typically shorter than the word • We can maintain dictionary entries • (word, code) • and make additions to the dictionary as we read the input file.

Lempel & Ziv (1977/78)

Fred Hacker’s algorithm… • Fred now knows what to do… • Create the dictionary: ( <the-whole-file>, 1 ) • Transmit 1, done.

Right? • Fred’s algorithm provides excellent compression, but…

Right? • Fred’s algorithm provides excellent compression, but… • …the receiver does not know what is in the dictionary! • And sending the dictionary is the same as sending the entire uncompressed file • Thus, we can’t decompress the “1”.

Hence… • …we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

LZW Compression: The Binary Version LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)

Maintaining a dictionary • We need a way of incrementally building up a dictionary during compression in such a way that… • …someone who wants to uncompress can “rediscover” the very same dictionary • And we already know that a convenient way to build a dictionary incrementally is to use a trie

Binary LZW • In this method, we build up binary tries • In a binary trie, each node has two children • In addition, we will add the following: • each left edge is marked 0 • each right edge is marked 1 • each leaf has a label from the set {0,…,n}

A binary trie 0 1 0 1 0 1 1 2 0 0 1 3 1 0 5 4

Binary LZW: Compression • We start with a binary trie consisting of a root node and two children • left child labeled 0, and right labeled 1 • We read the bits of the input file, and follow the trie • When a leaf is reached, we emit the label at the leaf • Then, add two new children to that leaf (converting it into an internal node)

Binary LZW: Compression, pt.2 • The new left child takes the old label • The new right child takes a new label value that is one greater than the current maximum label value

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 Output:

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 0 1 1 2 Output: 1

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 3 Output: 10

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 0 1 3 4 Output: 103

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 0 1 3 1 0 5 4 Output: 1034

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 2 0 1 0 6 3 1 0 5 4 Output: 10340

Binary LZW: Compression example 10010110011 Input: ^ 0 1 Dictionary: 0 1 0 1 1 0 1 0 1 7 2 0 6 3 1 0 5 4 Output: 103402

Binary LZW output • So from the input • 10010110011 • we get output • 103402 • To represent this output we can keep track of the number of labels n each time we emit a code • and use log(n) bits for that code

Binary LZW output • We started with input 10010110011 • Encoded it as 103402, for which we get the bit sequence 001 000 011 100 000 010 • This looks like an expansion instead of a compression • But what if we have a larger input, with more repeating sequences? • Try it!

Binary LZW output • One can also use Huffman compression on the output…

Binary LZW termination • Note that binary LZW has a serious problem, in that the input might end while we are in the middle of the trie (instead of at a leaf node) • This is a nasty problem • which is why we won’t use this binary method • But this is still good for illustration purposes…

Binary LZW: Uncompress • To uncompress, we need to read the compressed file and rebuild the same trie as we go along • To do this, we need to maintain the trie and also the maximum label value

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 Output:

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 0 1 2 1 Output: 1

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 0 2 1 3 Output: 10

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 0 2 1 0 1 3 4 Output: 1001

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 0 2 1 0 1 3 1 0 5 4 Output: 1001011

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 2 1 0 1 0 6 3 1 0 5 4 Output: 100101100

Binary LZW: Uncompress example 103402 Input: ^ 0 1 Dictionary: 0 1 0 1 1 0 1 0 1 7 2 0 6 3 1 0 5 4 Output: 10010110011

LZW Compression: The Byte Version

Byte method • The binary LZW method doesn’t really work • we show it for illustrative purposes • Instead, we use a slightly more complicated version that works on bytes or characters • We can think of each byte as a “character” in the range {0…255}

Byte method trie • Instead of a binary trie, we use a more general trie in which • each node can have up to n children (where n is the size of the alphabet), one for each byte/character • every node (not just the leaves) has an integer label from the set {0…m}, for some m • except the root node, which has no label

Byte method LZW • We start with a trie that contains a root and n children • one child for each possible character • each child labeled 0…n • When we compress as before, by walking down the trie • but, after emitting a code and growing the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

LZW: Byte method example • Suppose our entire character set consists only of the four letters: • {a, b, c, d} • Let’s consider the compression of the string • baddad

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 Output:

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 a 4 Output: 1

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 d a 5 4 Output: 10

Byte LZW: Compress example baddad Input: ^ Dictionary: b a c d 0 1 2 3 d a d 5 4 6 Output: 103

15-211 Fundamental Data Structures and Algorithms