240 likes | 458 Views
Data Compression Basics & Huffman Coding. Motivation of Data Compression. Lossless and Lossy Compression Techniques. Static Lossless Compression: Huffman Coding. Correctness of Huffman Coding : prefix property. Why Data Compression?.
E N D
Data Compression Basics & Huffman Coding Motivation of Data Compression. Lossless and Lossy Compression Techniques. Static Lossless Compression: Huffman Coding. Correctness of Huffman Coding : prefix property.
Why Data Compression? • Data storage and transmission cost money. This cost increases with the amount of data available. • This cost can be reduced by processing the data so that it takes less memory and less transmission time. • Data transmission is faster by using better transmission media or by compressing the data. • Data compression algorithms reduce the size of a given data without affecting its content. Examples: . Huffman coding . Run-Length coding . Lempel-Ziv coding
Lossless and Lossy Compression Techniques • Data compression techniques are broadly classified into lossless and lossy. • Lossless techniques enable exact reconstruction of the original document from the compressed information while lossy techniques do not. • Run-length, Huffman and Lempel-Ziv are lossless while JPEG and MPEG are lossy techniques. • Lossy techniques usually achieve higher compression rates than lossless ones but the latter are more accurate.
Lossless and Lossy Compression Techniques (cont'd) • Lempel-Ziv reads variable-sized input and outputs fixed length bits while Huffman coding is the exact opposite. • Lossless techniques are classified into static and adaptive. • In a static scheme, like Huffman coding, the data is first scanned to obtain statistical information before compression begins. • Adaptive models like Lempel-Ziv begin with an initial statistical distribution of the text symbols but modifies this distribution as each character or word is encoded. • Adaptive schemes fit the text more closely but static schemes involve less computations and are faster.
Introduction to Huffman Coding • What is the likelihood that all symbols in a message to be transmitted have the same number of occurrences? • Huffman coding assigns different bits to characters based on their frequency of occurrences in the given message. • The string to be transmitted is first analysed to find the relative frequencies of its constituent characters. • The coding process generates a binary tree, the Huffman code tree, with branches labeled with bits (0 and 1). • The Huffman tree must be sent with the compressed information to enable the receiver decode the message.
Example 1: Huffman Coding Example 1: Information to be transmitted over the internet contains the following characters with their associated frequencies as shown in the following table: .Use Huffman technique to answer the following questions: • Build the Huffman code tree for the message. • Use the Huffman tree to find the codeword for each character. • If the data consists of only these characters, what is the total number of bits to be transmitted? What is the percentage saving if the data is sent with 8-bit ASCII values without compression? • Verify that your computed Huffman codewords are correct.
Example 1: Huffman Coding (Solution) • Solution: The Huffman coding process uses a priority queue and binary trees using the frequencies. • We begin by filling the priority queue with one-node binary trees each containing a frequency count and the symbol with that frequency. • The initial priority queue is built by arranging the one-node binary trees in decreasing order of frequency. • The object with the lowest priority is designated as the front of the queue. • At each step, the priority queue is manipulated as outlined next:
Example 1: Huffman Coding (Solution) • The priority queue is manipulated as follows: • 1. Dequeue two trees from the front of the queue. • 2. Construct a new binary tree from the two trees as follows: • a. Construct a new tree by using the two trees that were dequeued as • the left and right subtrees of the new tree • b. Give the new tree the priority that is the sum of the priorities of its left and right subtrees. • 3. Enqueue the new tree using as its priority the sum of the priorities of the two trees used to construct it. • 4. Continue this process until only one tree is in the priority queue.
Example 1: Huffman Coding Step 1 front l o s n a t e 13 18 22 45 45 53 65
Example 1 Solution (cont'd) front s n a t e 22 31 45 45 53 65 l o
Example 1 Solution (cont'd) • front • n a t e 45 45 53 53 65 • s 31 • l o
Example 1 Solution (cont'd) • front t e • 53 53 65 90 s 31 n a • l o
Example 1 Solution (cont'd) • front • e • 65 90 106 • n a 53 t • s 31 • l o
Example 1 Solution (cont'd) • front • 106 155 • 53 t e 90 • s 31 n a • l o
Example 1 Solution (cont'd) 261 106 155 53 t e 90 s 31 n a l o
Example 1 Solution (cont'd) • 261 • 106 155 • 53 t e 90 • s 31 n a • l o 1 0 1 1 0 0 1 0 0 1 0 1
Example 1 Solution (cont'd) • 261 • 106 155 • 53 t e 90 • s 31 n a • l o 1 0 1 1 0 0 1 0 0 1 0 1
Example 1 Solution (cont'd) • The sequence of zeros and ones that are the arcs in the path from the root to each terminal node are the desired codes: Character a e l n o s t • if we assume the message consists of only the characters a,e,l,n,o,s and t then the number of bits transmitted will be: • 2*65+2*53+3*45+3*45+3*22+4*18+4*13 = 696 bits • If the message is sent uncompressed with 8-bit ASCII • representation for the characters, we have • 261*8 = 2088 bits, i.e. we saved about 70% transmission time.
Example 1 Solution: The Prefix Property • Data encoded using Huffman coding is uniquely decodable. This is because Huffman codes satisfy an important property called the prefix property. • This property guarantees that no codeword is a prefix of another Huffman codeword • For example, 10 and 101 cannot simultaneously be valid Huffman codewords because the first is a prefix of the second. • Thus, any bitstream is uniquely decodable with a given Huffman code. • We can see by inspection that the codewords we generated (shown in the preceding slide) are valid Huffman codewords.
Exercises • Using the Huffman tree constructed in this session, decode the following sequence of bits, if possible. Otherwise, where does the decoding fail? 10100010111010001000010011 • Using the Huffman tree construted in this session, write the bit sequences that encode the messages: test , state , telnet , notes • Mention one disadvantage of a lossless compression scheme and one disadvantage of a lossy compression scheme. • Write a Java program that implements the Huffman coding algorithm.