Lossless Compression

Lossless Compression - II Hao Jiang Computer Science Department Sept. 18, 2007

Properties of Huffman Coding • Huffman coding uses longer codewords for symbols with smaller probabilities and shorter codewords for symbols that often occur. • The two longest codewords differ only in the last bit. • The codewords are prefix codes and uniquely decodable. • H · Average Codeword Length < H + 1

Extended Huffman Coding • Huffman coding is not effective for cases when there are small number of symbols and the probabilities are highly skewed. • Example: A source has 2 symbols a and b. P(a) = 0.9 and P(b) = 0.1. H = 0.4690 For Huffman Coding, average codeword length is 1. (far from optimal !)

Extended Huffman Coding (cont) • We can encode a group symbols together and get better performance. • For the previous example, an extended source has symbols {aa, ab, ba, aa} and P(aa) = P(a)*P(a) = 0.81 => 1 P(ab) = P(a)*P(b) = 0.09 => 00 P(bb) = P(b)*P(b) = 0.09 => 011 P(bb) = P(a)*P(b) = 0.01 => 010 Now the average codeword length per symbol is 0.6450 (much better!).

Extended Huffman Coding (cont) 1223231212 P(1) = 0.3 p(2) = 0.5 P(3) = 0.2 Codewords: 1 -> 10 2 -> 0 3-> 11 Average codeword length = 2 * 0.3 + 1 * 0.5 + 2 * 0.2 = 1.5 P(12) = 0.6 P(23) = 0.4 codewords: 12 -> 0 23 -> 1 Average codeword length = (1 * 0.6 + 1 * 0.4)/2 = 0.5 In the second case, the average codeword length is smaller than the entropy of single symbol one. Is this right?

Dictionary Based • Dictionary based method is another way to capture the correlation of symbols. • Static dictionary • Good when the data to be compressed is specific in some application. • For instance, to compress a student database, the world “Name”, “Student ID” will often appear. • Static dictionary method does not work well if the source characteristics change.

Adaptive Dictionary • LZ77 (Jacob Ziv and Abraham Lempel 1977) encoder Step n: a b c a a c d a a b c d a b b b a b Longest match string length = 3 Match position 8 If No match, <0, 0, c(x)> Codeword generated is <8, 3, c(d)> Step n+1: a b c a a c d a a b c d a b b b a b Codeword generated is <4, 2, c(b)>

LZ77 Decoder a b c a a c d a a b c d Codeword generated is <8, 3, c(d)> a b c a a c d a a b c d Then move the window by 4 characters and repeat.

A Special Case c d d c d c a b a b a b a d b b a b The output codeword is < 2, 5, d>

LZ78 • LZ78 uses an explicit dictionary. Encoding Process Example: Input: a b c b a b a a a

LZ78 Decoding Example

LZW • Encoder s = next input character; While not EOF { c = next input character; if s + c is in the directory s = s + c; else { output the codeword for s; add s+c to the directory; s = c; } } Output code for s

LZW encoding example The input string: a b a b b a b c a b EOF

LZW Decoder s = empty string; While ( (k = next input code) != EOF ) { entry = dictionary entry for k; if (k is not in the dictionary) entry = s + s[0]; output entry; if (s is not empty) add string (s+entry[0]) to dictionary; s = entry; }

LZW Decoding example: The input string: 1 2 4 5 2 3 4 EOF

Lossless Compression - II