Greedy Algorithms (Huffman Coding)

Greedy Algorithms(Huffman Coding)

Huffman Coding Huffman coding • A technique to compress data effectively • Usually between 20%-90% compression • Lossless compression • No information is lost • When decompress, you get the original file Compressed file Original file

Huffman Coding: Applications Huffman coding • Saving space • Store compressed files instead of original files • Transmitting files or data • Send compressed data to save transmission time and power • Encryption and decryption • Cannot read the compressed file without knowing the “key” Compressed file Original file

Main Idea: Frequency-Based Encoding • Assume in this file only 6 characters appear • E, A, C, T, K, N • The frequencies are: • Option I (No Compression) • Each character = 1 Byte (8 bits) • Total file size = 14,700 * 8 = 117,600 bits • Option 2 (Fixed size compression) • We have 6 characters, so we need 3 bits to encode them • Total file size = 14,700 * 3 = 44,100 bits

Main Idea: Frequency-Based Encoding(Cont’d) • Assume in this file only 6 characters appear • E, A, C, T, K, N • The frequencies are: • Option 3 (Huffman compression) • Variable-length compression • Assign shorter codes to more frequent characters and longer codes to less frequent characters • Total file size: (10,000 x 1) + (4,000 x 2) + (300 x 3) + (200 x 4) + (100 x 5) + (100 x 5) = 20,700 bits

Huffman Coding • A variable-length coding for characters • More frequent characters  shorter codes • Less frequent characters  longer codes • It is not like ASCII coding where all characters have the same coding length (8 bits) • Two main questions • How to assign codes (Encoding process)? • How to decode (from the compressed file, generate the original file) (Decoding process)?

Decoding for fixed-length codes is much easier 010001100110111000 Divide into 3’s 010 001 100 110 111 000 Decode C A T K N E

Decoding for variable-length codes is not that easy… 000001 It means what??? AEC EAC EEEC Huffman encoding guarantees to avoid this uncertainty …Always have a single decoding

Huffman Algorithm • Step 1: Get Frequencies • Scan the file to be compressed and count the occurrence of each character • Sort the characters based on their frequency • Step 2: Build Tree & Assign Codes • Build a Huffman-code tree (binary tree) • Traverse the tree to assign codes • Step 3: Encode (Compress) • Scan the file again and replace each character by its code • Step 4: Decode (Decompress) • Huffman tree is the key to decompress the file

Step 1: Get Frequencies Input File: Eerie eyes seen near lake.

Step 2: Build Huffman Tree & Assign Codes • It is a binary tree in which each character is a leaf node • Initially each node is a separate root • At each step • Select two roots with smallest frequency and connect them to a new parent (Break ties arbitrary) [The greedy choice] • The parent will get the sum of frequencies of the two child nodes • Repeat until you have one root

Example Each char. has a leaf node with its frequency

Find the smallest two frequencies…Replace them with their parent

Now we have a single root…This is the Huffman Tree

Lets Analyze Huffman Tree • All characters are at the leaf nodes • The number at the root = # of characters in the file • High-frequency chars (E.g., “e”) are near the root • Low-frequency chars are far from the root

Lets Assign Codes • Traverse the tree • Any left edge  add label 0 • As right edge  add label 1 • The code for each character is its root-to-leaf label sequence

Lets Assign Codes 1 • Traverse the tree • Any left edge  add label 0 • As right edge  add label 1 • The code for each character is its root-to-leaf label sequence 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0

Lets Assign Codes Coding Table • Traverse the tree • Any left edge  add label 0 • As right edge  add label 1 • The code for each character is its root-to-leaf label sequence

Huffman Algorithm • Step 1: Get Frequencies • Scan the file to be compressed and count the occurrence of each character • Sort the characters based on their frequency • Step 2: Build Tree & Assign Codes • Build a Huffman-code tree (binary tree) • Traverse the tree to assign codes • Step 3: Encode (Compress) • Scan the file again and replace each character by its code • Step 4: Decode (Decompress) • Huffman tree is the key to decompess the file

Step 3: Encode (Compress) The File Coding Table Input File: Eerie eyes seen near lake. + Generate the encoded file 0000 10 1100 0001 10 …. Notice that no code is prefix to any other code  Ensures the decoding will be unique (Unlike Slide 8)

Step 4: Decode (Decompress) • Must have the encoded file + the coding tree • Scan the encoded file • For each 0  move left in the tree • For each 1  move right • Until reach a leaf node  Emit that character and go back to the root 0000 10 1100 0001 10 …. + Generate the original file Eerie …

Huffman Algorithm • Step 1: Get Frequencies • Scan the file to be compressed and count the occurrence of each character • Sort the characters based on their frequency • Step 2: Build Tree & Assign Codes • Build a Huffman-code tree (binary tree) • Traverse the tree to assign codes • Step 3: Encode (Compress) • Scan the file again and replace each character by its code • Step 4: Decode (Decompress) • Huffman tree is the key to decompess the file

Greedy Algorithms (Huffman Coding)