240 likes | 475 Views
Compression. Compression. Compression ratio: how much is the size reduced? Symmetric/asymmetric: time difference to compress, decompress? Lossless; lossy: any information lost in the process to compress and decompress? Adaptive/static: does the compression dictionary change during processing?.
E N D
Compression • Compression ratio: how much is the size reduced? • Symmetric/asymmetric: time difference to compress, decompress? • Lossless; lossy: any information lost in the process to compress and decompress? • Adaptive/static: does the compression dictionary change during processing?
Run-length encoding • Noticing long runs of repeated data • Lossless, completely reversible • represent the run with a count and a value • Example: • SSSSSVVVVVVVVVVTTTTTTTTTURRRRR • 4S9V8T0U4R • Use count - 1 to maximize use of the range of values available. • Alternative: • Only encode where there is repetition. Use a non-occurring (escape) character to indicate compression begins.
Example for Run Length Encode Assume: the background color is represented by 00000000, face color by 00000001, eye color by 00000010, smile by 00000011 and there are 50 pixels across each row. What would be the encoding with/without run length coding?
Huffman encoding • Statistical encoding • Requires knowledge of relative frequency of elements in the string • Sender and receiver must both know encoding chosen • Create a tree structure that assigns longest representations to most rarely used symbols
Huffman example • First, start with statistics about occurrence of symbols in the text to be compressed. • That assumption might not be right for every message. • Sometimes expressed as percentage, sometimes as relative frequencies • A(5) E(7) S(5) I(4) D(4) G(3) N(2) P(2) R(1) W(1) • We want shorter codes for A, E, longer codes for R, W to minimize the overall message lengths • We are saying that in analysis of a large body of typical text, we find that the occurrence of E is 7 times more common than the occurrence of W, for example
Constructing the Code First, combine the least frequently used symbols The weight (frequency) of the pair (R,W) is 2, of the pair (N,P) is 4 2 4 R(1) W(1) N(2) P(2) Insert the next least frequently used symbol (G, with weight=3) Choose the place to insert the new symbol to minimize the total weights produced. Choices: add 3 to 2 = 5, add 3 to 4 = 7, or add existing 2 to 4 = 6 and make a new subtree for the 3. A(5) E(7) S(5) I(4) D(4) G(3) N(2) P(2) R(1) W(1)
34 19 15 10 5 A(5) E(7) 9 8 2 I(4) D(4) G(3) 4 S(5) R(1) W(1) N(2) P(2) A(5) E(7) S(5) I(4) D(4) G(3) N(2) P(2) R(1) W(1)
34 0 1 19 0 1 15 10 1 0 0 1 5 A(5) E(7) 9 8 0 1 1 0 1 0 2 I(4) D(4) G(3) 4 1 0 S(5) 0 1 R(1) W(1) N(2) P(2) A(5) E(7) S(5) I(4) D(4) G(3) N(2) P(2) R(1) W(1)
E = 11 A = 001 S = 011 I = 101 D = 100 G = 0000 N = 0100 P = 0101 R = 00010 W = 00011 Average code length: A has weight 5 and length 3, etc. 7*2 + 5*3 + 5*3 +4*3 +4*3 + 3*4 + 2*4 + 2*4 +1*5 + 1*5 = 106/34 =3.117 Completed code A(5) E(7) S(5) I(4) D(4) G(3) N(2) P(2) R(1) W(1)
In class exercise • Working in pairs, encode a message of at least 15 letters using the code we just generated. • Do not leave any spaces between the letters in your message. • Pass the message to some other team. • Make sure you give and get a message. • Decode the message you received.
Entropy per symbol Entropy, E, is information content Entropy is inversely proportional to the probability of occurrence E = -∑pi log2 pi i=1,n where n is the number of symbols and pi is the probability of occurrence of the ith symbol This is the lower bound on weighted compression -- the goal to shoot for. How well did we do in our code? 3.098 to our 3.117
Properties of the Huffman code • Variable length code • Prefix property • Average bits per symbol (entropy) • Huffman codes approach the theoretical limit for amount of information per symbol • Static coding. Code must be known by sender and receiver and used consistently
Dynamic Huffman Code • Build the code as the message is transmitted. • The code will be the best for this particular message. • Sender and receiver use the same rules for building the code.
Constructing the tree • Sender and receiver begin with an initial tree consisting of a root node and a left child with a null character and weight = 0 • First character is sent uncompressed and is added to the tree as the right branch from the root. The new node is labeled with the character, its weight is 1 and the tree branch is labeled 1 also. • A list shows the tree entries in order
Example • banana: r Initial tree *(0) r Weight (1) = number of times that character has occurred so far Transmit b *(0) b(1) *(0) b(1) List version of the tree
A new character seen • Whenever a new character appears in the message, it is sent as follows: • send the path to the empty node • send the uncompressed representation of the new character. • Place the new character into the tree and update the list representation. r *(0) a(1) 1 b(1) Null node moves down to make room for the new node as its sibling List is formed by reading the tree left to right, bottom level to top level 1 b(1) *(0) a (1) ba
Another character r 2 b(1) • n 1 a (1) *(0) n(1) 1 a(1) 2 b(1) *(0) n(1) List entries are not in non decreasing order. Adjust the list and show the corresponding tree. r b(1) 2 *(0) n(1) 1 a(1) b(1) 2 1 a (1) *(0) n(1) (Note all left branches are coded as 1, all right branches as 0) ban
Our first repeated character r • a b(1) 3 *(0) n(1) 1 a(2) b(1) 3 1 a (2) *(0) n(1) Again there is a problem. The numbers in the list do not obey the requirement of non decreasing order r a(2) 2 Adjust the list and make the tree match *(0) n(1) 1 b(1) a(2) 2 1 b(1) Note that the 3 changed to a 2 as a result of the tree restructuring. *(0) n(1) bana
Another repeat Code sent for this n will be 101 corresponding to the original position of n. Then the restructuring will be done. r a(2) 3 • n 2 b(1) *(0) n(2) 2 b(1) a(2) 3 *(0) n(2) Another misfit. r b and n must trade places a(2) 3 1 n(2) *(0) b(1) 1 n(2) a(2) 3 *(0) b(1) banan
One more letter • a This a is encoded as 0. No restructuring of the tree is needed. r a(3) 3 1 n(2) *(0) b(1) 1 n(2) a(3) 3 *(0) b(1) banana
In class exercise • Create the dynamic Huffman code for the “message” = Tennessee
Summary • Compression seeks to minimize the amount of transmission by making efficient representations for the data. • Static compression keeps the same codes and depends on consistency in the distribution of characters to code • Dynamic compression adjusts as it works to allow the most efficient compression for the current message.
Some extra resources • Huffman coding resources: http://www.dogma.net/DataCompression/Huffman.shtml • Final note David Huffman died October 7, 1999 at age 74 “Huffman is probably best known for the development of the Huffman Coding Procedure, the result of a term paper he wrote while a graduate student at the Massachusetts Institute of Technology (MIT). "Huffman Codes" are used in nearly every application that involves the compression and transmission of digital data, such as fax machines, modems, computer networks, and high-definition television.” http://www.ucsc.edu/currents/99-00/10-11/huffman.html