330 likes | 360 Views
Index construction: Compression of documents. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79. Raw docs are needed. Various Approaches. Statistical coding Huffman codes Arithmetic codes Dictionary coding
E N D
Index construction:Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
Various Approaches Statistical coding • Huffman codes • Arithmetic codes Dictionary coding • LZ77, LZ78, LZSS,… • Gzip, zippy, snappy,… Text transforms • Burrows-Wheeler Transform • bzip
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa
Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.
0 1 a 0 1 d b c Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1
Average Length For a code C with codeword length L[s], the average length is defined as p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--] La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)
0 <= H <= log |S| H -> 0, skewed distribution H max, uniform distribution Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability higher information Entropy is the weighted average of i(s) i(s) 0-th order empirical entropy of string T
Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. Empirical H vs Compression ratio Shannon In practice Avg cw length p(A) = .7, p(B) = p(C) = p(D) = .1 H≈ 1.36 bits Huffman ≈ 1.5 bits per symb An optimal code is surely one that…
Document Compression Huffman coding
Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms • gzip, bzip, jpeg (as option), fax compression,… Properties: • Generates optimalprefix codes • Cheap to encode and decode • La(Huff) = H if probabilities are powers of 2 • Otherwise, La(Huff)< H +1 < +1 bit per symb on avg!!
0 1 1 (.3) 1 0 (.5) 0 (1) Running Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) a=000, b=001, c=01, d=1 There are 2n-1“equivalent” Huffman trees What about ties (and thus, tree depth) ?
Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. 1 0 (.5) d(.5) abc... 00000101 1 0 (.3) 101001... dcb c(.2) 0 1 a(.1) b(.2)
Huffman in practice The compressed file of n symbols, consists of: • Preamble: tree encoding + symbols in leaves • Body: compressed text of n symbols Preamble = Q(|S| log |S|)bits Body is at least nH bits and at most nH+n bits Extra+n is bad for very skewed distributions,namely ones for which H -> 0 Example: p(a) = 1/n, p(b) = n-1/n
There are better choices T=aaaaaaaaab • Huffman = {a,b}-encoding + 10 bits • RLE = <a,9><b,1> = g(9) + g(1) + {a,b}-encoding = 0001001 1 + {a,b}-encoding So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits. Fax, bzip,… are using RLE
Idea on Huffman? Goal: Reduce the impact of the +1 bit Solution: • Divide the text into blocks of k symbols • The +1 is spread over k symbols • So the loss is 1/k per symbol Caution: Alphabet = Sk, tree gets larger, and so preamble.At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only. This means no compression at all !
Document Compression Arithmetic coding
Ideal performance. In practice, it is 0.02 * n Introduction It uses “fractional” parts of bits!! Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman) Used in JPEG/MPEG (as option), bzip More time costly than Huffman, but integerimplementation is not too bad.
Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. 1.0 cum[c] = p(a)+p(b) = .7 cum[b] = p(a) = .2 cum[a] = .0 c = .3 0.7 b = .5 0.2 a = .2 0.0 The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))
0.7 1.0 0.3 c c c = .3 0.7 0.55 0.27 b b b = .5 0.2 0.3 0.22 a a a = .2 0.0 0.2 0.2 Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) (0.3-0.2)*0.3=0.03 (0.7-0.2)*0.3=0.15 (0.7-0.2)*0.5 = 0.25 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1
The algorithm To code a sequence of symbols with probabilitiespi (i = 1..n) use the following algorithm: 0.3 p(c) = .3 0.27 p(b) = .5 p(a) = .2 0.2
The algorithm Each message narrows the interval by a factor p[Ti] Final interval size is Sequence interval [ ln , ln + sn ) Take a number inside
0.7 0.55 1.0 c c c = .3 0.7 0.55 0.475 b b b = .5 0.49 0.49 0.49 0.2 0.3 0.35 a a = .2 a 0.0 0.2 0.3 Decoding Example Decoding the number .49, knowing the message is of length 3: The message is bbc.
How do we encode that number? If x = v/2k (dyadic fraction)then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
How do we encode that number? Binary fractional representation: FractionalEncode(x) • x = 2 * x • If x < 1 output 0 • else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3 Incremental Generation
Which number do we encode? Truncate the encoding to the first d = log2 (2/sn)bits Truncation gets a smaller number… how much smaller? ln + sn ln + sn/2 Truncation Compression ln 0∞
Bound on code length Theorem:For a text T of length n, the Arithmetic encoder generates at most log2 (2/sn)< 1 + log2 2/sn = 1 + (1 - log2 sn) = 2 - log2(∏ i=1,n p(Ti)) = 2 - log2(∏s [p(s)occ(s)]) = 2 - ∑s occ(s) * log2 p(s) ≈ 2 + ∑s ( n*p(s) ) * log2 (1/p(s)) = 2 + n H(T)bits T = acabc sn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2
Document Compression Dictionary-based compressors
LZ77 Algorithm’s step: • Output <dist, len, next-char> • Advance by len + 1 A buffer “window” has fixed length and moves a a c a a c a b c a a a a a a a c <6,3,a> Dictionary(all substrings starting here) a c a a c a a c a b c a a a a a a c <3,4,c>
LZ77 Decoding Decoder keeps same dictionary window as encoder. • Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) • E.g. seen = abcd, next codeword is (2,9,e) • Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] • Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length)or(1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code