280 likes | 302 Views
Explore compression methods like loss-less gzip for storage, Lempel-Ziv coding, MPEG, JPEG, and more to maximize data storage efficiency and reduce redundancy. Dive into text encoding, Shannon's Information theory, and algorithms like Shannon-Fano and Huffman for optimal results.
E N D
Compression For sending and storing information Text, audio, images, videos
Common Applications • Text compression • loss-less, gzip uses Lempel-Ziv coding, 3:1 compression • better than Huffman • Audio compression • lossy, mpeg 3:1 to 24:1 compression • MPEG = motion picture expert group • Image compression • lossy, jpeg 3:1 compression • JPEG = Joint photographic expert group • Video compression • lossy, mpeg 27:1 compression
Text Compression • Prefix code: one, of many, approaches • no code is prefix of any other code • constraint: loss-less • tasks • encode: text (string) -> code • decode: code --> text • main goal: maximally reduce storage, measured by compression ratio • minor goals: • simplicity • efficiency: time and space • some require code dictionary or 2 passes of data
Simplest Text Encoding • Run-length encoding • Requires special character, say @ • Example Source: • ACCCTGGGGGAAAACCCCCC • Encoding: • A@C3T@G5@4A@C6 • Method • any 3 or more characters are replace by @char# • +: simple • -: special characters, non-optimal
Shannon’s Information theory (1948)How well can we encode? • Shannon’s goal: reduce size of messages for improved communication • What messages would be easiest/hardest to send? • Random bits hardest - no redundancy or pattern • Formal definition: S, a set of symbols si • Information content of S = -sum pi*log(pi) • measure of randomness • more random, less predictable, higher information content! • Theorem: only measure with several natural properties • Information is not knowledge • Compression relies on finding regularities or redundancies.
Example • Send ACTG each occurring 1/4 of the time • Code: A--00, C--01, T--10, G--11 • 2 bits per letters: no surprise • Average message length: • prob(A)*codelength(A)+prob(B)*codelength(B) +… • 1/4*2+…. = 2 bits. • Now suppose: • prob(A) = 13/16 and other 1/16 • Codes: A - 1; C-00, G-010, T-011 (prefix) • 13/16*1+ 1/16*2+ 1/16*3+1/16*3=21/16 = 1.3+ • What is best result? Part of the answer: • The information content! But how to get it?
The Shannon-Fano Algorithm • Earliest algorithm: Heuristic divide and conquer • Illustration: source text with only letters ABCDE • Symbol A B C D E • ---------------------------------- • Count 15 7 6 6 5 • Intuition: frequent letters get short codes • 1. Sort symbols according to their frequencies/probabilities, i.e. ABCDE. • 2. Recursively divide into two parts, each with approx. same number of counts. • This is instance of “balancing” which is NP-complete. • Note: variable length codes.
Result for this distribution • Symbol Count -log(1/p) Code (# of bits) ------ ----- -------- --------- -------------------- A 15 1.38 00 30 B 7 2.48 01 14 C 6 2.70 10 12 D 6 2.70 110 18 E 5 2.96 111 15 TOTAL (# of bits): 89 average message length = 89/39=2.3 Note: Prefix property for decoding Can you do better? Theoretical optimum = -sum pi*log(pi) = entropy
Code Tree Method/Analysis • Binary tree method • Internal nodes have left/right references: • 0 means go to the left • 1 means go to the right • Leaf nodes store the value • Decode time-cost is O(logN) • Decode space-cost is O(N) • quick argument: number of leaves > number of internal nodes. • Proof: induction on ….. • number of internal nodes. • Prefix Property: each prefix uniquely defines char.
Code Encode(character) • Again can use binary prefix tree • For encode and decode could use hashing • yields O(1) encode/decode time • O(N) space cost ( N is size of alphabet) • For compression, main goal is reducing storage size • in example it’s the total number of bits • code size for single character = depth of tree • code size for document = sum of (frequency of char * depth of character) • different trees yield different storage efficiency • What’s the best tree?
Huffman Code • Provably optimal: i.e. yields minimum storage cost • Algorithm: CodeTree huff(document) 1. Compute the frequency and a leaf node for each char • leaf node has countfield and character 2. Remove the 2 nodes with least counts and create a new node with count equal to the sum of counts and sons, the removed nodes. • internal node has 2 node ptrs and count field 3. Repeat 2 until only 1 node left. 4. That’s it!
Analysis • Intuition: least frequent chars get longest codes or most frequent chars get shortest codes. • Let T be a minimal code tree. (Induction) • All nodes have 2 sons. (by construction) • Lemma: if c1 and c2 be least frequently used then they are at the deepest depth • Proof: • if not deepest nodes, exchange and total cost (number of bits) goes down
Analysis (continued) • Sk : Huffman algorithm on k chars produces optimal code. • S2: obvious • Sk => Sk+1 • Let T be optimal code on k+1 chars • By lemma, two least freq chars are deepest • Replace two least freq char by new char with freq equal to sum • Now have tree with k nodes • By induction, Huffman yields optimal tree.
Lempel-Ziv • Input: string of characters • Internal: dictionary of (codewords, words) • Output: string of codewords and characters. • Codewords are distinct from characters. • In algorithm, w is a string, c is character and w+c means concatenation. • When adding a new word to the dictionary, a new code word needs to be assigned.
Lempel-Ziv Algorithm w = NIL; while ( read a character c ) { if w+c exists in the dictionary w = w+c; else add w+c to the dictionary; output the code for w; w = k; }
Adaptive Encoding • Webster has 157,000 entries: could encode in X bits • but only works for this document • Don’t want to do two passes • Adaptive Huffman • modify model on the fly • Zempel-Liv 1977 • ZLW Zempel-Liv Welsh • 1984 used in compress (UNIX) • uses dictionary method • variable number of symbols to fixed length code • better with large documents- finds repetitive patterns
Audio Compression • Sounds can be represented as a vector valued function • At any point in time, a sound is a combination of different frequencies of different strengths • For example, each note on a piano yields a specific frequency. • Also, our ears, like pianos, have cilia that responds to specific frequencies. • Just like sin(x) can be approximated by small number of terms, e.g. x -x^3/3+x^5/120…, so can sound. • Transforming a sound into its “spectrum” is done mathematically by a fourier transform. • The spectrum can be played back, as on computer with sound cards.
Audio • Using many frequencies, as in CDs, yields a good approximation Using few frequenices, as in telephones, a poor approximation • Sampling frequencies yields compresssion ratios between 6 to 24, depending on sound and quality • High-priced electronic pianos store and reuse “samples” of concert pianos • High filter: removes/reduces high frequencies, a common problem with aging • Low filter: removes/reduces low frequencies • Can use differential methods: • only report change in sounds
Image Compression • with or without loss, mostly with • who cares about what the eye can’t see • Black and white images can regarded as functions from the plane (R^2) into the reals (R), as in old TVs • positions vary continuous, but our eyes can’t see the discreteness around 100 pixels per inch. • Color images can be regarded as functions from the plane into R^3, the RGB space. • Colors are vary continuous, but our eyes sample colors with only 3 difference receptors (RGB) • Mathematical theories yields close approximation • there are spatial analogues to fourier transforms
Image Compression • faces can be done with eigenfaces • images can be regarded a points in R^(big) • choose good bases and use most important vectors • i.e. approximate with fewer dimensions: • JPEG, MPEG, GIF are compressed images
Video Compression • Uses DCT (discrete cosine transform) • Note: Nice functions can be approximated by • sum of x, x^2,… with appropriate coefficients • sum of sin(x), sin(2x),… with right coefficients • almost any infinite sum of functions • DCT is good because few terms give good results on images. • Differential methods used: • only report changes in video
Summary • Issues: • Context: what problem are you solving and what is an acceptable solution. • evaluation: compression ratios • fidelity, if loss • approximation, quantization, transforms, differential • adaptive, if on-the-fly, e.g. movies, tv • Different sources yield different best approaches • cartoons versus cities versus outdoors • code book separate or not • fixed or variable length codes