Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein

Data Compression and Huffman Trees(HW 4)Data Structures Fall 2008Modified by Eugene Weinstein

Representing Text(ASCII) • Way of representing characters as bits • Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’… • Each character is represented by a unique 7 bit code. There are 128 possible characters. • STATIC LENGTH ENCODING • To encode a long text, we encode it character by character.

Inefficiency of ASCII • Realization: In many natural files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! • Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.

Variable Length Coding • Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) • Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). • Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop.

Encoding Trees • Think of encoding as an (unbalanced) tree. • Data is in leaf nodes only (prefix free). • ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 • How to decode ‘01110’? 1 0 e 0 1 a b

Cost of a Tree • For each character ci let fi be its frequency in the file. • Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). • The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi

Creating an Optimal T • Problem: Find tree T with C(T) minimal. • Solution (Huffman 1952): • Create a tree for each character. The weight of the tree W(T) is the frequency of the character. • Repeat n-1 times (n = number of chars) • Select trees T’, T’’ with lowest weights. Merge them together to form T. • Set W(T)= W(T’) + W(T’’) • Implement Using Min-Heap. • What is running time?

Optimality Intuition • Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σci fi. • The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). • Intuitively when we combine trees we can think of this as a new letter with combined weight.

Homework • Implement: • public class HuffmanTree • Has traversal/code printing method • public class HuffmanNode (Comparable) • Contains letter, integer frequency • Has accessor (getter) methods • public class BinaryHeap (given in class) • Read a file ‘huff.txt’ which includes letters and frequencies: • A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 • Create a Huffman Tree, algorithm: book 389-395 • Print “legend”: the code of each character

Tips and Implementation Notes • HuffmanNode should be Comparable to work with BinaryHeap • How to implement compareTo method? • Implement toString method in BinaryHeap • Print heap after every rearrangement • Understand binary heap operations: • insert • deleteMin 10

Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein

Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein

Presentation Transcript

Compsci 201, Algorithmic Paradigms Huffman Compression

Huffman Codes

HW 4

Trees 4: AVL Trees

9.4 Huffman Trees

Huffman Compression Project

HW 4

HW 4 Answers

Huffman Trees and ID3

HW # 4

Search Trees: BSTs and B-Trees

Dynamic Huffman Trees

Real Time Delta Huffman Compression in Sensor Networks

Hw # 4

HW 4 Strings

AERSP 301 Bending of open and closed section beams

Ch.4 HW

HW 4 Answers

Code Compression

Decision Trees

Homeworks and Assignments