240 likes | 401 Views
CS336: Intelligent Information Retrieval. Lecture 6: Compression. Compression. Change the representation of a file so that takes less space to store less time to transmit Text compression: original file can be reconstructed exactly Data Compression: can tolerate small changes or noise
E N D
CS336: Intelligent Information Retrieval Lecture 6: Compression
Compression • Change the representation of a file so that • takes less space to store • less time to transmit • Text compression: original file can be reconstructed exactly • Data Compression: can tolerate small changes or noise • typically sound or images • already a digital approximation of an analog waveform
Text Compression: Two classes • Symbol-wise Methods • estimate the probability of occurrence of symbols • More accurate estimates => Greater compression • Dictionary Methods • represent several symbols as one codeword • compression performance based on length of codewords • replace words and other fragments of text with an index to an entry in a “dictionary”
model model compressed text text decoder encoder Compression: Based on text model • Determine output code based on pr distribution of model • # bits should use to encode a symbol equivalent to information content (H(s) = - log Pr(s)) • Recall Information Theory and Entropy (H) • avg amount of information per symbol over an alphabet • H = (Pr(s) * I(s)) • The more probable a symbol is, the less information it provides (easier to predict) • The less probable, the more information it provides
model model compressed text text decoder encoder Compression: Based on text model • Symbol-wise methods (statistical) • Can assume independence assumption • Considering context achieves best compression rates • e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’ • Must yield probability distribution such that the probability of the symbol actually occurring is very high • Dictionary methods • string representations typically based upon text seen so far • most chars coded as part of a string that has already occurred • Space issaved if the reference or pointer takes fewer bits than the string it replaces
Stastic Statistical Models • Static: same model regardless of text processed • But, won’t be good for all inputs • Recall entropy: measures information content over collection of symbols • high pr(s)=> low entropy; low pr(s)=> high entropy • pr(s) = 1 => only s is possible, no encoding needed • pr(s) = 0 =>s won’t occur, so s can not be encoded • What is implication for models in which some pr(s) = 0, but s does in fact occur? s can not be encoded! Therefore in practice, all symbols must be assigned pr(s) > 0
Semi-static Statistical Models • Generate model on fly for file being processed • first pass to estimate probabilities • probabilities transmitted to decoder before encoded transmission • disadvantage is having to transmit model first
Adaptive Statistical Models • Model constructed from the text just encoded • begin with general model, modify gradually as more text is seen • best models take into account context • decoder can generate same model at each time step since it has seen all characters up to that point • advantages: effective w/out fine-tuning to particular text and only 1 pass needed
Adaptive Statistical Models • What about pr(0) problem? • allow 1 extra count divided evenly amongst unseen symbols • recall probabilities estimated via occurrence counts • assume 72,300 total occurrences, 5 unseen characters: • each unseen char, u, gets pr(u) = pr(char is one of the unseen) * pr(all unseen char) = 1/5 * 1/72301 • Not good for full-text retrieval • must be decoded from beginning • therefore not good for random access to files
Symbol-wise Methods • Morse Code • common symbols assigned short codes (few bits) • rare symbols have longer codes • Huffman coding (prefix-free code) • no codeword is a prefix of another symbol’s codeword
Huffman coding • What is the string for 1010110111111110? • Decoder identifies 10 as first codeword => e • Decoding proceeds l to rt on remainder of string • Good when prob. distribution is static • Best when model is word-based • Typically used for compressing text in IR eefggf
1.0 0.4 1 0.2 0.6 1 0.1 0.3 1 1 0 0 1 a b c d e f g 0.05 0.05 0.1 0.2 0.3 0.2 0.1 • Create a decoding tree bottom-up • Each symbol and pr are leafs • 2 nodes w/ smallest p are joined • under same parent (p = p(s1)+p(s2) • Repeat, ignoring children, until all • nodes are connected • 4. Label nodes • Left branch gets 0 • Right branch 1 0 0 0 1 0
Huffman coding • Requires 2 passes over the document • gather statistics and build the coding table • encode the document • Must explicitly store coding table with document • eats into your space savings on short documents • Exploit only non-uniformity in symbol distribution • adaptive algorithms can recognize the higher-order redundancy in strings e.g. 0101010101....
Dictionary Methods • Braille • special codes are used to represent whole words • Dictionary: list of sub-strings with corresponding code-words • we do this naturally e.g) replace “december” w/ “12” • codebook for ascii might contain • 128 characters and 128 common letter pairs • At best each char encoded in 4 bits rather than 7 • can be static, semi-static, or adaptive (best)
Dictionary Methods • Ziv-Lempel • Built on the fly • Replace strings w/ reference to a previous occurrence • codebook • all words seen prior to current position • codewords represented by “pointers” • Compression: pointer stored in fewer bits than string • Especially useful when resources used must be minimized (e.g. 1 machine distributes data to many) • Relatively easy to implement • Decoding is fast • Requires only small amount of memory
b a a a b Ziv-Lempel • Coded output: series of triples <x,y,z> • x: how far back to look in previous decoded text to find next string • y: how long the string is • z: next character from the input • only necessary if char to be coded did not occur previously ptr into text <0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <1,10,a> a b a a b b 10 b’s followed by a
Accessing Compressed Text • Store information identifying a document’s location relative to the beginning of the file • Store bit offset • variable length codes mean docs may not begin/end on byte boundaries • insist that docs begin on byte boundaries • some docs will have waste bits at the end • need a method to explicitly identify end of coded text
Text Compression • Key difference b/w symbol-wise & dictionary • symbol-wise base coding of symbol on context • dictionary groups symbols together creating an implicit context • Present techniques give compression of ~2 bits/character for general English text • In order to do better, techniques must leverage • semantic content • external world knowledge • Rule of thumb: The greater the compression, • the slower the program runs or • the more memory it uses. • Inverted lists contain skewed data, so text compression methods are not appropriate.
Inverted File Compression • Note: each inverted list can be stored as an ascending sequence of integers • <8; 2, 4, 9, 45, 57, 76, 78, 80> • Replace with initial position w/ d-gaps • <8; 2,2,5,36,12,19,2,2> • on average gaps < largest doc num • Compression models describe probability distribution of d-gap sizes
Fixed-Length Compression • Typically, integer stored in 4 bytes • To compress the d-gaps for : <5; 1, 3, 7, 70, 250 > <5; 1, 2, 4, 63, 180 > • Use fewer bytes: • Two leftmost bits store the number of bytes (1-4) • The d-gap is stored in the next 6, 14, 22, 30 bits
Example • The inverted list is: 1, 3, 7, 70, 250 • After computing gaps: 1, 2, 4, 63, 180 Number bytes reduced from 5*4 to 6
Code • This method is superior to a binary encoding • Represent the number x in two parts: • Unary code: specifies bits needed to code x • 1 + log x followed by • binary code: codes x in that many bits • log x bits that represent x – 2 log x Let x = 9: log x = 3 ( 8 = 23 < 9 < 24 = 16) part 1: 1 + 3 = 4 => 1110 (unary) part 2: 9 – 8 = 1 code: 1110 001 Unary code: (n-1) 1-bits followed by a 0, therefore 1 represented by 0, 2 represented by 10, 3 by 110, etc.
Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 1 + log x 0 x – 2 log x Number bits reduced from 5*32 to 36. Decode: let u = unary, b = binary x = 2u-1 +b