100 likes | 108 Views
Learn about Huffman trees and data compression, optimize encoding, implement Huffman algorithm using Min-Heap, and create an optimal encoding tree.
E N D
Data Compression and Huffman Trees(HW 4)Data Structures Fall 2008Modified by Eugene Weinstein
Representing Text(ASCII) • Way of representing characters as bits • Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’… • Each character is represented by a unique 7 bit code. There are 128 possible characters. • STATIC LENGTH ENCODING • To encode a long text, we encode it character by character.
Inefficiency of ASCII • Realization: In many natural files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! • Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.
Variable Length Coding • Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) • Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). • Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop.
Encoding Trees • Think of encoding as an (unbalanced) tree. • Data is in leaf nodes only (prefix free). • ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 • How to decode ‘01110’? 1 0 e 0 1 a b
Cost of a Tree • For each character ci let fi be its frequency in the file. • Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). • The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi
Creating an Optimal T • Problem: Find tree T with C(T) minimal. • Solution (Huffman 1952): • Create a tree for each character. The weight of the tree W(T) is the frequency of the character. • Repeat n-1 times (n = number of chars) • Select trees T’, T’’ with lowest weights. Merge them together to form T. • Set W(T)= W(T’) + W(T’’) • Implement Using Min-Heap. • What is running time?
Optimality Intuition • Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σci fi. • The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). • Intuitively when we combine trees we can think of this as a new letter with combined weight.
Homework • Implement: • public class HuffmanTree • Has traversal/code printing method • public class HuffmanNode (Comparable) • Contains letter, integer frequency • Has accessor (getter) methods • public class BinaryHeap (given in class) • Read a file ‘huff.txt’ which includes letters and frequencies: • A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 • Create a Huffman Tree, algorithm: book 389-395 • Print “legend”: the code of each character
Tips and Implementation Notes • HuffmanNode should be Comparable to work with BinaryHeap • How to implement compareTo method? • Implement toString method in BinaryHeap • Print heap after every rearrangement • Understand binary heap operations: • insert • deleteMin 10