690 likes | 942 Views
Data Compression. Compression is a high-profile application .zip, .mp3, .aac, .jpg, .gif, .gz, .mpg, … What property of MP3 was a significant factor in what made P2P file sharing work? Why do we care? Secondary storage capacity doubles every year
E N D
Data Compression • Compression is a high-profile application • .zip, .mp3, .aac, .jpg, .gif, .gz, .mpg, … • What property of MP3 was a significant factor in what made P2P file sharing work? • Why do we care? • Secondary storage capacity doubles every year • Disk space fills up quickly on every computer system • More data to compress than ever before • Data compression facilitated by priority queue • All-time best assignment in a Compsci 100(e) course? • Subject to debate, of course
More on Compression • What’s the difference between compression techniques? • .mp3 files and .zip files? • .gif and .jpg? • Lossless and lossy • Is it possible to compress (lossless) every file? Why? • Lossy methods • Good for pictures, video, and audio (JPEG, MPEG, etc.) • Lossless methods • Run-length encoding, Huffman, LZW, … 11 3 5 3 2 6 2 6 5 3 5 3 5 3 10
Text Compression • Input: String S • Output: String S • Shorter • S can be reconstructed from S
1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 a a b b 0 1 e e c c d d Text Compression: Examples • “abcde” in the different formats • ASCII: 01100001011000100110001101100100… • Fixed: • 000001010011100 • Var: • 000110100110 • Encodings ASCII: 8 bits/character Unicode: 16 bits/character
13 7 3 3 s s * * 1 1 2 2 6 6 g g o o 4 4 3 3 3 3 2 2 2 2 p p e e h h r r 1 1 1 1 1 1 1 1 Huffman coding: go go gophers ASCII 3 bits Huffman g 103 1100111 000 00 o 111 1101111 001 01 p 112 1110000 010 1100 h 104 1101000 011 1101 e 101 1100101 100 1110 r 114 1110010 101 1111 s 115 1110011 110 101 sp. 32 1000000 111 101 • Encoding uses tree: • 0 left/1 right • How many bits? 37!! • Savings? Worth it?
Huffman Coding • D.A Huffman in early 1950’s • Before compressing data, analyze the input stream • Represent data using variable length codes • Variable length codes though Prefix codes • Each letter is assigned a codeword • Codeword is for a given letter is produced by traversing the Huffman tree • Property: No codeword produced is the prefix of another • Letters appearing frequently have short codewords, while those that appear rarely have longer ones • Huffman coding is optimal per-character coding method
Building a Huffman tree • Begin with a forest of single-node trees (leaves) • Each node/tree/leaf is weighted with character count • Node stores two values: character and count • There are n nodes in forest, n is size of alphabet? • Repeat until there is only one node left: root of tree • Remove two minimally weighted trees from forest • Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) • Does this process terminate? How do we get minimal trees? • Remove minimal trees, hummm……
R N C E F P U L S G T O D B A M I 11 1 2 1 2 4 2 2 5 4 1 3 5 3 3 6 3 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
F A S E N G C B M U O R L P T D I 11 2 6 5 5 1 1 1 2 2 3 2 2 2 4 3 3 4 3 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
L G M E N C F T S P U B R O D A I 11 2 1 1 1 3 5 6 2 5 2 3 2 2 3 2 3 3 2 4 4 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
M E S F T C O P N A R L B D G U I 11 2 2 1 1 3 5 5 6 2 1 4 3 3 4 2 4 3 3 3 2 2 4 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
D O N F S C T P E M R L A B G U I 11 2 2 2 1 1 6 5 5 3 2 1 4 4 3 4 2 3 4 3 4 3 3 4 2 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
D N E F P C S U T M L A G B O R I 11 2 2 2 2 1 6 1 5 5 3 5 1 2 3 5 4 3 4 4 3 4 2 4 4 3 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
P N F B C T U S E L M D G A O R I 11 2 2 2 2 1 6 1 5 5 3 6 1 3 4 6 2 4 2 4 4 4 4 5 5 3 3 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
G F E B C P U R N S D M O A T L I 11 5 1 1 3 1 2 6 2 2 3 2 5 2 6 5 6 4 5 3 4 3 2 4 4 4 4 6 6 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
U B M T O S G D L R P C A F N E I 11 2 3 8 2 2 2 2 1 3 1 1 5 5 6 3 3 3 4 8 6 4 6 6 6 5 4 5 4 4 4 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
U B M T O S G D L R P C A F N E I 11 2 3 8 2 2 2 2 1 3 1 1 5 5 6 3 3 3 5 6 8 4 8 6 8 6 4 6 4 4 5 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O R T S M A B N E U F G C D P L I 11 10 10 4 2 2 1 2 1 3 2 3 5 3 5 3 6 1 6 2 3 4 4 5 5 6 6 4 8 8 6 8 8 2 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O L E S M N A R B T C P G U D F I 11 11 11 10 10 4 4 2 2 2 3 1 3 1 3 1 3 5 5 8 2 2 3 4 4 5 6 6 8 6 8 8 2 6 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O D M N S E F A L B P T U R G C I 12 11 11 10 12 10 11 1 4 1 3 1 3 2 2 3 2 2 4 3 5 5 2 4 4 5 6 8 8 6 8 8 6 2 3 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O G F N E S C M D A U B R T L P I 12 11 16 12 10 11 10 16 11 4 1 4 2 3 3 2 3 2 3 1 2 2 3 2 4 5 6 8 6 8 1 5 5 6 4 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
O G F N E S C M D A U B R T L P I 16 11 21 12 21 11 10 16 12 4 1 4 2 3 3 2 3 2 3 1 2 2 3 2 4 5 6 8 6 8 1 5 5 6 4 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
S M U A B T O E N F D C L P R G I 21 23 11 11 16 12 16 23 21 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
S M U A B T O E N F D C L P R G I 21 23 11 11 37 12 16 37 23 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
S M U A B T O E N F D C L P R G I 21 23 11 11 37 12 16 60 60 10 1 1 4 2 2 2 3 3 3 2 1 2 5 2 6 4 3 6 4 6 5 4 8 8 3 5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
Encoding • Count occurrence of all occurring character O( ) • Build priority queueO( ) • Build Huffman treeO( ) • Create Table of codes from treeO( ) • Write Huffman tree and coded data to fileO( )
Properties of Huffman coding • Want to minimize weighted path length L(T)of tree T • wi is the weight or count of each codeword i • di is the leaf corresponding to codeword i • How do we calculate character (codeword) frequencies? • Huffman coding creates pretty full bushy trees? • When would it produce a “bad” tree? • How do we produce coded compressed data from input efficiently?
Writing code out to file • How do we go from characters to encodings? • Build Huffman tree • Root-to-leaf path generates encoding • Need way of writing bits out to file • Platform dependent? • Complicated to write bits and read in same ordering • See BitInputStream and BitOutputStream classes • Depend on each other, bit ordering preserved • How do we know bits come from compressed file? • Store a magic number
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 01100000100001001101
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 1100000100001001101
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 100000100001001101
O G S N E F M P C U B R T L D A I 12 11 37 60 10 21 16 23 11 4 1 4 2 3 3 2 2 3 3 1 2 2 2 3 4 5 6 8 6 8 1 5 5 6 4 Decoding a message 00000100001001101
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0000100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 000100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 00100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 100001001101 G
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 00001001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 0001001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 001001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1001101 GO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 001101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 101 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01 GOO
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 1 GOOD
T N C E F O U R P D G B A M S L I 60 37 23 11 21 10 11 12 16 2 4 6 4 5 5 3 2 3 1 2 3 3 1 4 5 6 8 6 8 2 1 2 2 3 4 Decoding a message 01100000100001001101 GOOD
Decoding • Read in tree data O( ) • Decode bit string with tree O( )