compress!

prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008 cdwd 0 100 101 11100 110 11101 11110 11111 prob. 0.8 0.2 cdwd 0 1 AAA AAB ABA ABB BAA BAB BBA BBB A B compress! From theoretical viewpoint... • block Huffman codes achieve the best efficiency. for one symbol for three symbols for one symbol

prob. 0.512 0.128 0.128 0.032 0.128 0.032 0.032 0.008 cdwd 0 100 101 11100 110 11101 11110 11111 AAA AAB ABA ABB BAA BAB BBA BBB problem of block Huffman codes From practicalviewpoint... • block Huffman codes have some problems: • a large tableis needed for the encoding/decoding  run-lengthHuffman code  arithmetic code • probabilities must be known in advance  Lempel-Ziv codes three coding techniques

1/3 run-length Huffman code a coding scheme which is good for “biased” sequences • we focus binaryinformation source • alphabet = , with • data compression in the facsimile system

run and run-length run = a sequence of consecutive identical symbol A B B A A A A A B A A A B of A run of length = 3 run of length = 1 run of length = 0 run of length =5 The message is recovered if the lengths of runs are given.  encode the length of runs, not the pattern itself

upper-bound the run-length small problem? ... there can be very, very, very long run  put an upper-bound limit : run-length limited (RLL) coding upper-bound = 3 • ABBAAAAABAAAB = • one “A” followed by B • zero“A” followed by B • three or more “A”s • two“A”s followed by B • three or more “A”s • zero “A” followed by B run length 0 1 2 3 4 5 6 7 : representation 0 1 2 3+0 3+1 3+2 3+3+0 3+3+1 :

run-length Huffman code ... is a Huffman code defined to encode the length or runs • effective when there is bias of symbol probabilities p(A) = 0.9, p(B) = 0.1 run length 0 1 2 3 or more block pattern B AB AAB AAA prob. 0.1 0.09 0.081 0.729 codeword 10 110 111 0 • ABBAAAAABAAAB: 1, 0, 3+, 2, 3+, 0 ⇒ 110 10 0 111 0 10 • AAAABAAAAABAAB: 3+, 1, 3+, 2, 2 ⇒ 0 110 0 111 111 • AAABAAAAAAAAB: 3+, 0, 3+, 3+, 2 ⇒ 0 10 0 0 111

comparison • P(A) = 0.9, p(B) = 0.1 • the entropy of X: H(X) = –0.9log20.9 – 0.1log20.1=0.469 bit symbol A B prob. 0.9 0.1 codeword 0 1 • code 1: a naive Huffman code average codeword length = 1 • code 2: blocked (3bit) average codeword length = 1.661/3symbols = 0.55/symbol AAA AAB ABA ABB 0.729 0.081 0.081 0.009 0 100 110 1010 BAA BAB BBA BBB 0.081 0.009 0.009 0.009 1110 1011 11110 11111

comparison (cnt’d) consider typical runs... • before: ; A or Bs • after: ; 0 or 1s the average codeword length per symbol = 2.466 / 5.215 = 0.47 • code 3: run-length Huffman (upper-bound = 8) length 0 1 2 3 prob. 0.1 0.09 0.081 0.073 codeword 110 1000 1001 1010 length 4 5 6 7+ prob. 0.066 0.059 0.053 0.478 codeword 1011 1110 1111 0 RLL is a small trick, but it fully utilizes Huffman coding technique

2/3 arithmetic code a coding scheme which does not use the translation table • table-lookupis replaced by “on-the-fly” computation • translation table is not needed • slightly complicated computation is needed • It is proved that its average codeword length

# 0 1 2 3 4 5 6 7 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 0 0.343 0.490 0.637 0.700 0.847 0.910 0.973 preliminary • -th order extended source with • we encode one of patterns in • 8 data patterns • in the dictionary order • :prob. that occurs • :accumulation ofprobs. ↑ accumulation of before

0 0.343 0.490 0.637 0.700 0.847 0.910 0.973 illustration of probabilities • the 8 data patterns define a partition of the interval [0, 1]; 0 0.5 1.0 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 A(ABB) A(BAA) = A(ABB)+P(ABB) # 0 1 2 3 4 5 6 7 AAA AAB ABA ABB BAA BAB BBA BBB 0.343 0.147 0.147 0.063 0.147 0.063 0.063 0.027 • occupies the interval basic idea: • represent by a value problem to solve: • need a translation between and ↑ ↑ size & left-endof the interval

 A B AA AB BA AB AAA AAB ABA ABB BAA BAB BAB BBB 0.027 0.343 0.147 0.147 0.063 0.147 0.063 0.063 about the translation two directions of the translation: • [encode] the translation from to • [decode] the translation from to ...use recursive computation instead of a static table “a land of a parent is divided & inherited to two children” P(w) A(w) P(wA) P(wB) A(wA) A(wB)

[encode] the translation from to recursively determine and for prefixes of • （is a null string） • for , • for , the interval of ABB?  A B AA AB ABBinherits [0.637, 0.637 + 0.063) ABA ABB

[encode] the translation from to (cnt’d) We know the interval ; which of should we choose? • should have the shortest binary representation • choose but trim atplaces 0.aa...aaa...a +0.00...01b...b 0.aa...acc...c 0.aa...ac0...0 the length of ≈ – log2 = most significant non-zero place of 0.aa...ac almost ideal! 0.aa...aaa...a 0.aa...acc...c

choice of (sketch in decimal notation) Find that is the shortest in decimal. 0.123456 0.126543 0.12654 0.1265 0.126 0.126543 ) 0.123456 0.003087 0.12 round off some digits of 0.126543, but not too many... # of fraction places that must have = the most significant nonzero place of =

[decode] the translation from to given , determine the leaf node whose interval contains • almost the same as the first half of the encoding translation • compute, compare, and move to the left or right  A B AA AB threshold value ABA ABB 0.600 is contained in the interval of ABA...decoding completed

performance, summary an -symbol pattern with probability  encoded to a codeword with length • the average codeword length per symbol is • almost optimum coding without using a translation table however... • we need much computation with good precision ( use approximation?)

3/3 Lempel-Ziv codes a coding scheme which does not need probability distribution • the encoder learns the statistical behavior of the source • the translation table is constructed in an adaptive manner • works finely even for information sources with memory

probability in advance? so far, we assumed that the probabilities of symbols are known... in the real world... • the symbol probabilities are often not known in advance • scan the data twice? • first scan...count the number of symbol occurrences • second scan...Huffman coding • delay of the encoding operation... • overhead to transmit the translation table...

Lempel-Ziv algorithms for information sources whose symbol probability is not known... • LZ77 • lha, gzip, zip, zoo, etc. • LZ78 • compress, arc, stuffit, etc. • LZW • GIF, TIFF, etc. work fine for any information sources  universal coding

LZ77 L • proposed by A. Lempel and J. Zivin 1977 • represent a data substring by using a substring which has been occurred previously algorithm overview • process the data from the beginning • partition the data to blocks in a dynamic manner • represent a block by a three-tuple • “rewind symbols, copy symbols, and append ” Z –1 0 encoding completed

encoding example of LZ77 • consider to encode ABCBCDBDCBCD symbol A B C B C D B D C B C D history first time first time first time = (here) – 2 = (here) – 2 ≠ (here) – 2 = (here) – 3 ≠ (here) – 3 = (here) – 6 = (here) – 6 = (here) – 6 = (here) – 6 codeword (0, 0, A) (0, 0, B) (0, 0, C) (2, 2, D) (3, 1, D) (6, 4, *)

decoding example of LZ77 • decode (0, 0, A), (0, 0, B), (0, 0, C), (2, 2, D), (3, 1, D), (6, 4, *) possible problem: • large block is good, because we can copy more symbols • large block is bad, because a codeword contains a large integer ... the trade-off degrades the performance.

–1 0 encoding completed LZ78 • proposed by A. Lempel and J. Ziv in 1978 • represent a block by a thw-tuple • “copy the -thblock before, and append ”

encoding example of LZ78 • consider to encode ABCBCBCDBCDE symbol A B C B C B C D B C D E history first time first time first time = (here) – 2 block = (here) – 1 block = (here) – 1 block codeword (0, A) (0, B) (0, C) (2, C) (1, D) (1, E) block # 1 2 3 4 5 6

decoding example of LZ78 • decode (0, A), (0, B), (0, C), (2, C), (1, D), (1, E) advantage against LZ77: • large block is good, because we can copy more symbols • is there anything wrong with large blocks?  the performance slightly better than LZ78

summary of LZ algorithms in LZ algorithms, the translation table is constructed adoptively • information sources with unknown symbol probabilities • information sources with memory • LZW: good material to learn intellectual property (知的財産) • UNISYS, CompuServe, GIF format, ...

summary of today’s class Huffman codes are good, but not practical sometimes... • run-length Huffman code • simple but effective for certain types of sources • arithmetic code • not so practical, but has strong back-up from theory • LZ codes • practical, practical, practical

compress!

compress!

Presentation Transcript

Writing in response to reading using graphic organizers

IV. Kinetic theory (continued – see previous lecture)

An Introduction to SAS Character Functions (including some new SAS ® 9 functions)

Stemming Algorithms

Objectives

Chapter 14

Chapter 14 The Behavior of Gases

Gas Laws

Word

Child and Infant CPR and Choking

Chapter 4- Forces and Motion

Linux+ Guide to Linux Certification

“Evalvid-RA” Simulation of rate adaptive video

Nature of Gases

Energy: Forms and Changes

GH2005 Gas Dynamics in Clusters III

Energy: Forms and Changes

Dimensionality Reduction

Energy: Forms and Changes

Chapter 11