150 likes | 340 Views
Inverted File Compression. In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철. Inverted File Compression. Inverted file entry < t ; f t ; [ d 1 , d 2 , …, d f t ]> t : term, f t : # of documents d k : document no. where d k < d k+1
E N D
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철
Inverted File Compression • Inverted file entry • <t; ft; [d1, d2, …, dft]> • t : term, ft : # of documents • dk : document no. where dk < dk+1 • < elephant; 8; [3, 5, 20, 21, 23, 76, 77, 78] > => < elephant; 8; [3, 2, 15, 1, 2, 53, 1, 1] > • gap = dk+1 -dk • Two compression classes • Global Methods V.S Local Methods
Unarycode • Simple method • fixed representation of the positive integer • log N (bits) • Unary code • gap이 x일 때, x-1 bit의 1과 1bit의 0으로 표현 • lx = (x - 1) + 1, Pr[x] = 2-x • eg) x = 9 일 때, => 111111110
code • code • 1 + log x bit의 unary code와 log x bit의 binary code(x - 2log x)로 표현 • lx = 1 + log x + log x, Pr[x] = 1/2x2 • eg) x = 9 일 때, log x = 3, x - 2log x=1 => 1110001 • V = <1, 2, 4, 8, 16,…> or V = <1, 2, 2, 4, 4, 4, 8,…> or ….
code • code • code와 표현 방법이 유사. • 1 + log x bit의 unary code대신에 code를 사용하고, log x bit의 binary code(x - 2log x)로 표현 • lx = 1 + 2log(1 + log x) + log x, Pr[x] = 1/2x(log x)2 • eg) x = 9 일 때, => 11000001
Global Bernoulli model • Pr[x] = (1-p)x-1p, p : gap x가 나타날 확률 • Golomb code • q + 1 bit의 unary code와 + log b or log b bit의 binary code • q = (x - 1) / b, r = x - q b - 1 • bA =log(2 - p) / - log(1 - p) 0.69(N n / f) • eg) b=3, r=0(0), 1(10), 2(11) b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111) x=9이면, q = 2, r = 2 따라서, 11011
Global “observed frequency” model • Based on observed frequency of appear gap size • Use arithmetic or Huffman code • In theory • better compression method • In practice • slightly better than and code
Local Bernoulli model • The frequency of term t, ft , is known • Bernoulli model on each individual inverted file entry can be used • Very common words are encoded with b=1. • Tantamount bitvector • thus, inverted file can never worse than bitvector. • Necessary to store the parameter ft • b can be used during decoding
(a) (b) (c) Word position in Bible : (a)bridegroom; (b)Jezebel; (c) twelfth Skewed Bernoulli model • Bernoulli model의 vector VG = <b, b, b, …> • VT = <b, 2b, 4b, 2ib, …> • slightly worse than the Golomb code
Local hyperbolic model • Pr[x] = / x, x = 1, 2, …, m • = 1 / (loge(m+1)+0.5772) • m is largest gap • Better performance • more complex to implement • requires the use of arithmetic coding
Local “observed frequency” model • The ultimate in local modeling • batched frequency • request more memory space • best compression method
Performance of Index Compression Methods Method Bits per pointer Bible GNUbib Comact TREC Global methods Unary 264 920 490 1719 Binary 15.00 16.00 18.00 20.00 Bernoulli 9.67 11.65 10.58 12.61 6.55 5.69 4.48 6.43 6.26 5.08 4.36 6.19 Observed frequency 5.92 4.83 4.21 5.83 Local methods Bernoulli 6.13 6.17 5.40 5.73 Hyperbolic 5.77 5.17 4.65 5.74 Skewed Bernoulli 5.68 4.71 4.24 5.28 Batched frequency 5.61 4.65 4.03 5.27
Compression of bitmaps • Bitmaps : Hierarchical bitvetor compression기법으로 압축 (a) original bitvector (b) hierarchical structure (c) flattened tree as a string of bits