Lossless Compression

Lossless Compression - I Hao Jiang Computer Science Department Sept. 13, 2007

Introduction • Compress methods are key enabling techniques for multimedia applications. • Raw media takes much storage and bandwidth • A raw video with 30 frame/sec, resolution of 640x480, 24bit color One second of video 30 * 640 * 480 * 3 = 27.6480 Mbytes One hour video is about 100Gbytes

Some Terms Information source Encoder (compression) Storage or networks Decoder (decompression) Data Input (a sequence of symbols from an alphabet) Code ( a sequence of codewords) Recovered data sequence Lossless compression: The recovered data is exactly the same as the input. Lossy compression: The recovered data approximates the input data. Compression ratio = (bits used to represent the input data) / (bits of the code)

Entropy • The number of bits needed to encode a media source is lower-bounded by its “Entropy”. • Self information of an event A is defined as -logbP(A) where P(A) is the probability of event A. If b equals 2, the unit is “bits”. If b equals e, the unit is “nats” If b is 10, the unit is “hartleys”

Example • A source outputs two symbols (the alphabet has 2 symbols) 0 or 1. P(0) = 0.25, P(1) = 0.75. Information we get when receiving a 0 is log_2 (1/0.25) = 2 bit ; when receiving a 1 is log_2 (1/0.75) = 0.4150 bit .

Properties of Self Information • The letter with smaller probability has high self information. • The information we get when receiving two independent letters are summation of each of the self information. -log2P(sa,sb) = -log2P(sa)P(sb) = [-log2P(sa)] + [- log2P(sa)]

Entropy • An source has symbols {s1, s2, …, sn}, and the symbols are independent, the average self-information is H= å1n P(si)log2(1/P(si)) bits • H is called the Entropy of the source. • The number of bits per symbol needed to encode a media source is lower-bounded by its “Entropy”.

Entropy (cont) • Example: A source outputs two symbols (the alphabet has 2 letters) 0 or 1. P(0) = 0.25, P(1) = 0.75. H = 0.25 * log_2 (1/0.25) + 0.75 * log_2(1/0.75) = 0.8113 bits We need at least 0.8113 bits per symbol in encoding.

The Entropy of an Image • An grayscale image with 256 possible levels. A={0, 1, 2, …, 255}. Assuming the pixels are independent and the grayscales are have equal probabilities, H = 256 * 1/256 *log2(1/256) = 8bits • What about an image with only 2 levels 0 and 255? Assuming, P(0) = 0.5 and P(255) = 0.5. H = 1 bit

Estimate the Entropy a a a b b b b c c c c d d Assuming the symbols are independent: P(a) = 3/13 P(b) = 4/13 P(c) = 4/13 P(d) = 2/13 H = [-P(a)log_2P(a)] + [-P(b)log_2P(b)] + [-P(c)log_2P(c)] + [-P(d)log_2P(d)] = 1.95bits

Coding Schemes A = {s1, s2, s3, s4} P(s1) = 0.125 P(s2) = 0.125 P(s3) = 0.25 P(s4) = 0.5 Its entropy H = 1.75 10 01 110 s1 s2 s3 s4 s1 s2 s3 s4 s1 s2 s3 s4 11 11 111 0 1 10 0 0 0 Not uniquely decodeable Good codewords and achieves lower bound

Huffman Coding 1 (1) s4 0.5 1 (01) s3 0.25 0.5 1 (001) s2 0 0.125 0.25 0 (000) s1 0.125 0

Another Example 1 (0) 0.4 a1 1 (10) 0.2 a2 0.4 0.6 1 0 0.2 (111) 0.2 a3 0 0.4 1 (1101) 0.1 a4 0.2 0 (1100) 0.1 a5 0

Lossless Compression - I