100 likes | 259 Views
Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein. Representing Text (ASCII). Way of representing characters as bits Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘<br>’, ‘t’…
E N D
Data Compression and Huffman Trees(HW 4)Data Structures Fall 2008Modified by Eugene Weinstein
Representing Text(ASCII) • Way of representing characters as bits • Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’… • Each character is represented by a unique 7 bit code. There are 128 possible characters. • STATIC LENGTH ENCODING • To encode a long text, we encode it character by character.
Inefficiency of ASCII • Realization: In many natural files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! • Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.
Variable Length Coding • Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) • Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). • Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop.
Encoding Trees • Think of encoding as an (unbalanced) tree. • Data is in leaf nodes only (prefix free). • ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 • How to decode ‘01110’? 1 0 e 0 1 a b
Cost of a Tree • For each character ci let fi be its frequency in the file. • Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). • The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi
Creating an Optimal T • Problem: Find tree T with C(T) minimal. • Solution (Huffman 1952): • Create a tree for each character. The weight of the tree W(T) is the frequency of the character. • Repeat n-1 times (n = number of chars) • Select trees T’, T’’ with lowest weights. Merge them together to form T. • Set W(T)= W(T’) + W(T’’) • Implement Using Min-Heap. • What is running time?
Optimality Intuition • Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σci fi. • The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). • Intuitively when we combine trees we can think of this as a new letter with combined weight.
Homework • Implement: • public class HuffmanTree • Has traversal/code printing method • public class HuffmanNode (Comparable) • Contains letter, integer frequency • Has accessor (getter) methods • public class BinaryHeap (given in class) • Read a file ‘huff.txt’ which includes letters and frequencies: • A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 • Create a Huffman Tree, algorithm: book 389-395 • Print “legend”: the code of each character
Tips and Implementation Notes • HuffmanNode should be Comparable to work with BinaryHeap • How to implement compareTo method? • Implement toString method in BinaryHeap • Print heap after every rearrangement • Understand binary heap operations: • insert • deleteMin 10