630 likes | 1.95k Views
Huffman coding. Optimal codes - I. A code is optimal if it has the shortest codeword length L This can be seen as an optimization problem. Optimal codes - II. Let’s make two simplifying assumptions no integer constraint on the codelengths Kraft inequality holds with equality
E N D
Optimal codes - I • A code is optimal if it has the shortest codeword length L • This can be seen as an optimization problem Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Optimal codes - II • Let’s make two simplifying assumptions • no integer constraint on the codelengths • Kraft inequality holds with equality • Lagrange-multiplier problem Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Optimal codes - III • Substitute into the Kraft inequality that is Note that the entropy, when we use base D for logarithms Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Optimal codes - IV • In practice the codeword lengths must be integer value, so obtained results is a lower bound • Theorem The expected length of any istantaneous D-ary code for a r.v. X satisfies this fundamental result derives frow the work of Shannon Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Optimal codes - V • What about the upper bound? • Theorem Given a source alphabet (i.e. a r.v.) of entropy it is possible to find an instantaneous binary code which length satisfies • A similar theorem could be stated if we use the wrong probabilities instead of the true ones ; the only difference is a term which accounts for the relative entropy Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
The redundance • It is defined as the average codeword legths minus the entropy • Note that (why?) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Compression ratio • It is the ratio between the average number of bit/symbol in the original message and the same quantity for the coded message, i.e. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Uniquely decodable codes • The set of the instantaneous codes are a small subset of the uniquely decodable codes. • It is possible to obtain a lower average code length L using a uniquely decodable code that is not instantaneous? NO • So we use instantaneous codes that are easier to decode Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Summary • Average codeword length L • for uniquely decodable codes (and for instantaneous codes) • In practice for each r.v. with entropy we can build a code with average codeword length that satisfies Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Shannon-Fano coding • The main advantage of the Shannon-Fano technique is its semplicity • Source symbols are listed in order of nonincreasing probability. • The list is divided in such a way to form two groups of as nearly equal probabilities as possible • Each symbol in the first group receives a 0 as first digit of its codeword, while the others receive a 1 • Each of these group is then divided according to the same criterion and additional code digits are appended • The process is continued until each group contains only one message Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
example H=1.9375 bits L=1.9375 bits Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Shannon-Fano coding - exercise • Encode, using Shannon-Fano algorithm Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Is Shannon-Fano coding optimal? H=2.2328 bits L=2.31 bits L1=2.3 bits Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - I • There is another algorithm which performances are slightly better than Shanno-Fano, the famous Huffman coding • It works constructing bottom-up a tree, that has symbols in the leafs • The two leafs with the smallest probabilities becomes sibling under a parent node with probabilities equal to the two children’s probabilities Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - II • At this time the operation is repeated, considering also the new parent node and ignoring its children • The process continue until there is only parent node with probability 1, that is the root of the tree • Then the two branches for every non-leaf node are labeled 0 and 1 (typically, 0 on the left branch, but the order is not important) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - example 1.0 1.0 0 0 1 0.4 0.4 1 0.2 0.2 0.6 0.6 0 1 0 1 0.1 0.1 0.3 0.3 0 1 0 1 a 0.05 a 0.05 b 0.05 b 0.05 c 0.1 c 0.1 d 0.2 d 0.2 e 0.3 e 0.3 f 0.2 f 0.2 g 0.1 g 0.1 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - example Exercise: evaluate H(X) and L(X) H(X)=2.5464 bits L(X)=2.6 bits !! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - exercise • Code the sequence aeebcddegfced and calculate the compression ratio Sol:0000 10 10 0001 001 01 01 10 111 110 001 10 01 Aver. orig. symb. length = 3 bits Aver. compr. symb. length = 34/13 C=..... Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - exercise • Decode the sequence 0111001001000001111110 Sol: dfdcadgf Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - exercise • Encode with Huffman the sequence 01$cc0a02ba10 and evaluate entropy, average codeword length and compression ratio Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - exercise • Decode (if possible) the Huffman coded bit streaming 01001011010011110101... Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - notes • In the huffman coding, if, at any time, there is more than one way to choose a smallest pair of probabilities, any such pair may be chosen • Sometimes, the list of probabilities is inizialized to be non-increasing and reordered after each node creation. This details doesn’t affect the correctness of the algorithm, but it provides a more efficient implementation Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - notes • There are cases in which the Huffman coding does not uniquely determine codeword lengths, due to the arbitrary choice among equal minimum probabilities. • For example for a source with probabilities it is possible to obtain codeword lengths of and of • It would be better to have a code which codelength has the minimum variance, as this solution will need the minimum buffer space in the transmitter and in the receiver Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - notes • Schwarz defines a variant of the Huffman algorithm that allows to build the code with minimum . • There are several other variants, we will explain the most important in a while. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Optimality of Huffman coding - I • It is possible to prove that, in case of character coding (one symbol, one codeword), Huffman coding is optimal • In another terms Huffman code has minimum redundancy • An upper bound for redundancy has been found where is the probability of the most likely simbol Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Optimality of Huffman coding - II • Why Huffman code “suffers” when there is one symbol with very high probability? • Remember the notion of uncertainty... The main problem is given by the integer constraint on codelengths!! • This consideration opens the way to a more powerful coding... we will see it later Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - implementation • Huffman coding can be generated in O(n) time, where n is the number of source symbols, provided that probabilities have been presorted (however this sort costs O(nlogn)...) • Nevertheless, encoding is very fast Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - implementation • However, spatial and temporal complexity of the decoding phase are far more important, because, on average, decoding will happen more frequently. • Consider a Huffman tree with n symbols • n leafs and n-1 internal nodes has the pointer to a symbol and the info that it is a leaf has two pointers Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
Huffman coding - implementation • 1 million symbols 16 MB of memory! • Moreover traversing a tree from root to leaf involves follow a lot of pointers, with little locality of reference. This causes several page faults or cache misses. • To solve this problem a variant of Huffman coding has been proposed: canonical Huffman coding Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
canonical Huffman coding - I 1.0 0 (1) 1 (0) 0.53 0.47 0 (1) 1 (0) 0 (1) 1 (0) 0.23 0.27 (1) 0 1 (0) 0 (1) 1 (0) ? a 0.11 b 0.12 c 0.13 d 0.14 e 0.24 f 0.26 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
canonical Huffman coding - II • This code cannot be obtained through a Huffman tree! • We do call it an Huffman code because it is instantaneous and the codeword lengths are the same than a valid Huffman code • numerical sequence property • codewords with the same length are ordered lexicographically • when the codewords are sorted in lexical order they are also in order from the longest to the shortest codeword Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
canonical Huffman coding - III • The main advantage is that it is not necessary to store a tree, in order to decoding • We need • a list of the symbols ordered according to the lexical order of the codewords • an array with the first codeword of each distinct length Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
canonical Huffman coding - IV Encoding. Suppose there are n disctinct symbols, that for symbol i we have calculated huffman codelength and numl[k]= number of codewords with length k firstcode[k] = integer for first code of length k nextcode[k] = integer for the next codeword of length k to be assigned symbol[-,-] used for decoding codeword[i] the rightmost bits of this integer are the code for symbol i
canonical Huffman - example • 1.Evaluate array numl • 2.Evaluate array firstcode • 3. Construct array codeword and symbol symbol 0 1 2 3 1 2 3 4 5
canonical Huffman coding - V Decoding. We have the arrays firstcode and symbols nextinputbit()function that returns next input bit firstcode[k] = integer for first code of length k symbol[k,n] returns the symbol number n with codelength k Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
canonical Huffman - example 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 Decoded: dhebad symbol[3,0] = d symbol[3,0] = d symbol[2,2] = h symbol[2,2] = h symbol[2,1] = e symbol[2,1] = e symbol[5,0] = b symbol[5,0] = b symbol symbol[2,0] = a symbol[2,0] = a 0 1 2 3 1 2 3 4 5 symbol[3,0] = d symbol[3,0] = d