290 likes | 495 Views
prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008. cdwd 0 100 101 11100 110 11101 11110 11111. prob. 0.8 0.2. cdwd 0 1. AAA AAB ABA ABB BAA BAB BBA BBB. A B. compress!. From theoretical viewpoint ... block Huffman codes achieve the best efficiency.
E N D
prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008 cdwd 0 100 101 11100 110 11101 11110 11111 prob. 0.8 0.2 cdwd 0 1 AAA AAB ABA ABB BAA BAB BBA BBB A B compress! From theoretical viewpoint... • block Huffman codes achieve the best efficiency. for one symbol for three symbols for one symbol
prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008 cdwd 0 100 101 11100 110 11101 11110 11111 AAA AAB ABA ABB BAA BAB BBA BBB problem of block Huffman codes From practicalviewpoint... • block Huffman codes have some problems: • a large tableis needed for the encoding/decoding run-lengthHuffman code arithmetic code • probabilities must be known in advance Lempel-Ziv codes three coding techniques
1/3 run-length Huffman code a coding scheme which is good for “biased” sequences • we focus binaryinformation source • alphabet = , with • data compression in the facsimile system
run and run-length run = a sequence of consecutive identical symbol A B B A A A A A B A A A B of A run of length = 3 run of length = 1 run of length = 0 run of length =5 The message is recovered if the lengths of runs are given. encode the length of runs, not the pattern itself
upper-bound the run-length small problem? ... there can be very, very, very long run put an upper-bound limit : run-length limited (RLL) coding upper-bound = 3 • ABBAAAAABAAAB = • one “A” followed by B • zero“A” followed by B • three or more “A”s • two“A”s followed by B • three or more “A”s • zero “A” followed by B run length 0 1 2 3 4 5 6 7 : representation 0 1 2 3+0 3+1 3+2 3+3+0 3+3+1 :
run-length Huffman code ... is a Huffman code defined to encode the length or runs • effective when there is bias of symbol probabilities p(A) = 0.9, p(B) = 0.1 run length 0 1 2 3 or more block pattern B AB AAB AAA prob. 0.1 0.09 0.081 0.729 codeword 10 110 111 0 • ABBAAAAABAAAB: 1, 0, 3+, 2, 3+, 0 ⇒ 110 10 0 111 0 10 • AAAABAAAAABAAB: 3+, 1, 3+, 2, 2 ⇒ 0 110 0 111 111 • AAABAAAAAAAAB: 3+, 0, 3+, 3+, 2 ⇒ 0 10 0 0 111
comparison • P(A) = 0.9, p(B) = 0.1 • the entropy of X: H(X) = –0.9log20.9 – 0.1log20.1=0.469 bit symbol A B prob. 0.9 0.1 codeword 0 1 • code 1: a naive Huffman code average codeword length = 1 • code 2: blocked (3bit) average codeword length = 1.661/3symbols = 0.55/symbol AAA AAB ABA ABB 0.729 0.081 0.081 0.009 0 100 110 1010 BAA BAB BBA BBB 0.081 0.009 0.009 0.009 1110 1011 11110 11111
comparison (cnt’d) consider typical runs... • before: ; A or Bs • after: ; 0 or 1s the average codeword length per symbol = 2.466 / 5.215 = 0.47 • code 3: run-length Huffman (upper-bound = 8) length 0 1 2 3 prob. 0.1 0.09 0.081 0.073 codeword 110 1000 1001 1010 length 4 5 6 7+ prob. 0.066 0.059 0.053 0.478 codeword 1011 1110 1111 0 RLL is a small trick, but it fully utilizes Huffman coding technique
2/3 arithmetic code a coding scheme which does not use the translation table • table-lookupis replaced by “on-the-fly” computation • translation table is not needed • slightly complicated computation is needed • It is proved that its average codeword length
# 0 1 2 3 4 5 6 7 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 0 0.343 0.490 0.637 0.700 0.847 0.910 0.973 preliminary • -th order extended source with • we encode one of patterns in • 8 data patterns • in the dictionary order • :prob. that occurs • :accumulation ofprobs. ↑ accumulation of before
0 0.343 0.490 0.637 0.700 0.847 0.910 0.973 illustration of probabilities • the 8 data patterns define a partition of the interval [0, 1]; 0 0.5 1.0 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 A(ABB) A(BAA) = A(ABB)+P(ABB) # 0 1 2 3 4 5 6 7 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 • occupies the interval basic idea: • represent by a value problem to solve: • need a translation between and ↑ ↑ size & left-endof the interval
A B AA AB BA AB AAA AAB ABA ABB BAA BAB BAB BBB 0.027 0.343 0.147 0.147 0.063 0.147 0.063 0.063 about the translation two directions of the translation: • [encode] the translation from to • [decode] the translation from to ...use recursive computation instead of a static table “a land of a parent is divided & inherited to two children” P(w) A(w) P(wA) P(wB) A(wA) A(wB)
[encode] the translation from to recursively determine and for prefixes of • (is a null string) • for , • for , the interval of ABB? A B AA AB ABBinherits [0.637, 0.637 + 0.063) ABA ABB
[encode] the translation from to (cnt’d) We know the interval ; which of should we choose? • should have the shortest binary representation • choose but trim atplaces 0.aa...aaa...a +0.00...01b...b 0.aa...acc...c 0.aa...ac0...0 the length of ≈ – log2 = most significant non-zero place of 0.aa...ac almost ideal! 0.aa...aaa...a 0.aa...acc...c
choice of (sketch in decimal notation) Find that is the shortest in decimal. 0.123456 0.126543 0.12654 0.1265 0.126 0.126543 ) 0.123456 0.003087 0.12 round off some digits of 0.126543, but not too many... # of fraction places that must have = the most significant nonzero place of =
[decode] the translation from to given , determine the leaf node whose interval contains • almost the same as the first half of the encoding translation • compute, compare, and move to the left or right A B AA AB threshold value ABA ABB 0.600 is contained in the interval of ABA...decoding completed
performance, summary an -symbol pattern with probability encoded to a codeword with length • the average codeword length per symbol is • almost optimum coding without using a translation table however... • we need much computation with good precision ( use approximation?)
3/3 Lempel-Ziv codes a coding scheme which does not need probability distribution • the encoder learns the statistical behavior of the source • the translation table is constructed in an adaptive manner • works finely even for information sources with memory
probability in advance? so far, we assumed that the probabilities of symbols are known... in the real world... • the symbol probabilities are often not known in advance • scan the data twice? • first scan...count the number of symbol occurrences • second scan...Huffman coding • delay of the encoding operation... • overhead to transmit the translation table...
Lempel-Ziv algorithms for information sources whose symbol probability is not known... • LZ77 • lha, gzip, zip, zoo, etc. • LZ78 • compress, arc, stuffit, etc. • LZW • GIF, TIFF, etc. work fine for any information sources universal coding
LZ77 L • proposed by A. Lempel and J. Zivin 1977 • represent a data substring by using a substring which has been occurred previously algorithm overview • process the data from the beginning • partition the data to blocks in a dynamic manner • represent a block by a three-tuple • “rewind symbols, copy symbols, and append ” Z –1 0 encoding completed
encoding example of LZ77 • consider to encode ABCBCDBDCBCD symbol A B C B C D B D C B C D history first time first time first time = (here) – 2 = (here) – 2 ≠ (here) – 2 = (here) – 3 ≠ (here) – 3 = (here) – 6 = (here) – 6 = (here) – 6 = (here) – 6 codeword (0, 0, A) (0, 0, B) (0, 0, C) (2, 2, D) (3, 1, D) (6, 4, *)
decoding example of LZ77 • decode (0, 0, A), (0, 0, B), (0, 0, C), (2, 2, D), (3, 1, D), (6, 4, *) possible problem: • large block is good, because we can copy more symbols • large block is bad, because a codeword contains a large integer ... the trade-off degrades the performance.
–1 0 encoding completed LZ78 • proposed by A. Lempel and J. Ziv in 1978 • represent a block by a thw-tuple • “copy the -thblock before, and append ”
encoding example of LZ78 • consider to encode ABCBCBCDBCDE symbol A B C B C B C D B C D E history first time first time first time = (here) – 2 block = (here) – 1 block = (here) – 1 block codeword (0, A) (0, B) (0, C) (2, C) (1, D) (1, E) block # 1 2 3 4 5 6
decoding example of LZ78 • decode (0, A), (0, B), (0, C), (2, C), (1, D), (1, E) advantage against LZ77: • large block is good, because we can copy more symbols • is there anything wrong with large blocks? the performance slightly better than LZ78
summary of LZ algorithms in LZ algorithms, the translation table is constructed adoptively • information sources with unknown symbol probabilities • information sources with memory • LZW: good material to learn intellectual property (知的財産) • UNISYS, CompuServe, GIF format, ...
summary of today’s class Huffman codes are good, but not practical sometimes... • run-length Huffman code • simple but effective for certain types of sources • arithmetic code • not so practical, but has strong back-up from theory • LZ codes • practical, practical, practical