110 likes | 186 Views
Squishin ’ Stuff. Huffman Compression. Data Compression. Begin with a computer file (text, picture, movie, sound, executable, etc ) Most file contain extra information or redundancy Goal: Reorganize the file to remove the excess information and redundancy
E N D
Squishin’ Stuff Huffman Compression
Data Compression • Begin with a computer file (text, picture, movie, sound, executable, etc) • Most file contain extra information or redundancy • Goal: Reorganize the file to remove the excess information and redundancy • Lossless Compression: Compress the file in such a way that none of the information is lost (good for text files and executables) • Lossy Compression: Allow some information to be thrown away in order to get a better level of compression (good for pictures, movies, or sounds) • Many, many, many algorithms out there to compress files • Different types of files work best with different algorithms (need to consider the structure of the file and how things are connected). • We’re going to focus on Huffman compression which is used many compression programs, most notably winzip. • We’re just going to play with text files.
Text Files • Each character is represented by one byte. Each byte is a sequence of 8 bits (1’s and 0’s) (ASCII code). • International standard for how a character is represented. • A 01000001 • B 01000010 • ~ 01111110 • 3 00110011 • Most text files use less than 128 characters; this code has room for 256. Extra information!! • Goal: Use shorter codes to represent more frequent characters. • You have seen this before…
RAWA AWIS RINBABBE • That didn’t work. • If we do this, we need a way to know when a letter stops. • Huffman coding provides this, though we’ll lose some compression. • Huffman Coding • Named after some guy called Huffman (1952). • Use a tree to construct the code, and then use the tree to interpret the code.
What’s the best you can do? • Obviously, there is a limit to how far down you can compress a file. • Assume your file has n different characters in it, say a1…an, each with probability p1…pn (so p1+p2+…+pn = 1). • The entropy of the file is defined to be negative of the sum of pilog2(pi). • Measures the least number of bits, on average, needed to represent a character. • For my name, the entropy is 3.12 (takes at least 3.12 bits per character to represent my name). Huffman gave an average of 3.19 bits per character. • Huffman compression will always give an average that is within one bit of entropy.