270 likes | 500 Views
Indexing. Overview of the Talk. Inverted File Indexing Compression of inverted files Signature files and bitmaps Comparison of indexing methods Conclusion. Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based.
E N D
Overview of the Talk • Inverted File Indexing • Compression of inverted files • Signature files and bitmaps • Comparison of indexing methods • Conclusion
Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based. Inverted File Indexing • Inverted file index • contains a list of terms that appear in the document collection (called a lexicon or vocabulary) • and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list. • Granularity of an index determines the accuracy of representation of the location of the word • Coarse-grained index requires less storage and more query processing to eliminate false matches • Word-level index enables queries involving adjacency and proximity, but has higher space requirements
Doc Text 1 Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, 3 Nine days old. 4 Some like it hot, some like it cold, 5 Some like it in the pot, 6 Nine days old. Terms Documents --------------------------- cold <2; 1, 4> days <2; 3, 6> hot <2; 1, 4> in <2; 2, 5> it <2; 4, 5> like <2; 4, 5> nine <2; 3, 6> old <2; 3, 6> pease <2; 1, 2> porridge <2; 1, 2> pot <2; 2, 5> some <2; 4, 5> the <2; 2, 5> Inverted File Index: Example Notation: N: number of documents; (=6) n: number of distinct terms; (=13) f: number of index pointers; (=26)
This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer than bits. Assume d-gap representation for the rest of the talk, unless stated otherwise Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of
Symbolwise methods are more suited for coding d-gaps Text Compression Two classes of text compression methods • Symbolwise (or statistical) methods • Estimate probabilities of symbols - modeling step • Code one symbol at a time - coding step • Use shorter code for the most likely symbol • Usually based on either arithmetic or Huffman coding • Dictionary methods • Replace fragments of text with a single code word (typically an index to an entry in the dictionary). • eg: Ziv-Lempel coding, which replaces strings of characters with a pointer to a previous occurrence of the string. • No probability estimates needed
model model text compressed text text encoder decoder Models can be static, semi-static or adaptive. Information content of a symbol s, denoted by I(s) is given by Shannon’s formula Entropy, or the average amount of information per symbol over the whole alphabet, denoted H, is given by Models
A 0.05 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 Huffman Coding: Example
A 0.05 0.1 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 Huffman Coding: Example
Huffman Coding: Example Symbol Code Probability A 0000 0.05 B 0001 0.05 C 001 0.1 D 01 0.2 E 10 0.3 F 110 0.2 G 111 0.1 1.0 0 1 0.4 0 1 0.2 0.6 0 1 1 0 0.1 0.3 0 1 0 1 A 0.05 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1
A) 1.0000 Pr[c]=1/3 0.6667 Pr[b]=1/3 Interval used to code b 0.3333 Pr[a]=1/3 C) 0.0000 0.6667 0.6667 Pr[c]=3/6 B) 0.6667 Pr[c]=2/5 0.6501 Final interval (represents whole output) Pr[c]=1/4 Interval used to code c Pr[b]=2/6 0.6390 0.5834 Pr[a]=1/6 Pr[b]=2/4 0.6334 0.6334 Pr[b]=2/5 0.4167 Pr[a]=1/4 0.3333 0.6001 Pr[a]=1/5 0.5834 Arithmetic Coding String = bccb Alphabet = {a, b, c} Code = 0.64
Arithmetic Coding produces near-optimal codes, given an accurate model Arithmetic Coding: Conclusions • High probability events do not reduce the size of the interval in the next step very much, whereas low-probability events do. • A small final interval requires many digits to specify a number guaranteed to be in the interval. • Number of bits required is proportional to the negative logarithm of the size of the interval. • A symbol s of probability Pr[s] contributes -log Pr[s] bits to the output.
Methods for Inverted File Compression • Methods for compressing d-gap sizes can be classified into • global: each list is compressed using the same model • local: the model for compressing an inverted list is adjusted according to some parameter, like the frequency of the term • Global methods can be divided into • non-parameterized: probability distribution for d-gap sizes is predetermined. • parameterized: probability distribution is adjusted according to certain parameters of the collection. • By definition, local methods are parameterized.
γ code: Number x is coded as a unary code for followed by a code of in binary. bits that represents δ code: Number of bits in binary is represented using γ code. For small integers, δ codes are longer than γ codes, but for large integers, the situation reverses. Non-parameterized models Unary code: An integer x > 0, is coded as (x-1) ‘1’ bits followed by a ‘0’ bit.
Non-parameterized models Each code has an underlying probability distribution, which can be derived using Shannon’s formula. Probability assumed by unary is too small.
Global parameterized models Probability that a random document contains a random term, Assuming a Bernoulli process, Arithmetic coding: Huffman-style coding (Golomb coding):
Global observed frequencymodel • Use exact d-gap values and then use arithmetic or Huffman coding • Only slightly better than γ or δ code • Reason: pointers are not scattered randomly in the inverted file • Need local methods for any improvement
Need an adaptive model that is good for clusters Local methods • Local Bernoulli • Use a different p for each inverted list • Use γ code for storing • Skewed Bernoulli • Local Bernoulli model is bad for clusters • Use a cross between γ and Golomb, with b=median gap size • Need to store b (use γ representation) • This is still a static model
Interpolative code Consider an inverted list Documents 8, 9, 11, 12 and 13 form a cluster Can do better with a minimal binary code
Performance of index compression methods Compression of inverted files in bits per pointer
Signature Files • Each document is given a signature, that captures its content • Hash each document term to get several hash values • Bits corresponding to those values are set to 1 • Query processing: • Hash each query term to get several hash values • If a document has all bits corresponding to those values set to 1, it may contain the query term • False matches • set several bits for each term • make the signatures sufficiently long • Naïve representation: may have to read the entire signature file for each query term • Use bitslicing to save on disk transfer time
Signature files: Conclusion • Design involves many tradeoffs • wide, sparse signatures reduce number of false matches • short, dense signatures require more disk accesses • For reasonable query times, requires more space than compressed inverted file • Inefficient for documents of varying sizes • Blocking makes simple queries difficult to answer • Text is not random
Bitmaps • Simple representation: For each term in the lexicon, store a bitvector of length N. A bit is set if and only if the corresponding document contains the term. • Efficient for boolean queries • Enormous amount of storage requirement, even after removing stop words • Have been used to represent common words
1100 0101 1010 0000 0000 Compression of signature files and bitmaps • Signature files are already in compressed form • Decompression affects query time substantially • Lossy compression results in false matches • Bitmaps can be compressed by a significant amount Compressed code: 1100 : 0101, 1010 : 0010, 0011, 1000, 0100 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000
Compressed inverted files are the most useful for indexing a collection of variable length text documents Comparison of indexing methods • All indexing methods are variations of the same basic idea!! • Signature files and inverted files require an order of magnitude less secondary storage than bitmaps • Signature files cause unnecessary access to the document collection unless signature width is large • Signature files are disastrous when record lengths vary a lot • Advantages of signature files • no need to keep lexicon in memory • better for conjunctive queries involving common terms
Conclusion • For practical purposes, the best index compression algorithm is the local Bernoulli method (using Golomb coding) • Compressed inverted indices are almost always better than signature files and bitmaps in most practical situations, in terms of both space and response time for queries