1 / 26

Indexing

Indexing. Overview of the Talk. Inverted File Indexing Compression of inverted files Signature files and bitmaps Comparison of indexing methods Conclusion. Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based.

tania
Download Presentation

Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing

  2. Overview of the Talk • Inverted File Indexing • Compression of inverted files • Signature files and bitmaps • Comparison of indexing methods • Conclusion

  3. Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based. Inverted File Indexing • Inverted file index • contains a list of terms that appear in the document collection (called a lexicon or vocabulary) • and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list. • Granularity of an index determines the accuracy of representation of the location of the word • Coarse-grained index requires less storage and more query processing to eliminate false matches • Word-level index enables queries involving adjacency and proximity, but has higher space requirements

  4. Doc Text 1 Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, 3 Nine days old. 4 Some like it hot, some like it cold, 5 Some like it in the pot, 6 Nine days old. Terms Documents --------------------------- cold <2; 1, 4> days <2; 3, 6> hot <2; 1, 4> in <2; 2, 5> it <2; 4, 5> like <2; 4, 5> nine <2; 3, 6> old <2; 3, 6> pease <2; 1, 2> porridge <2; 1, 2> pot <2; 2, 5> some <2; 4, 5> the <2; 2, 5> Inverted File Index: Example Notation: N: number of documents; (=6) n: number of distinct terms; (=13) f: number of index pointers; (=26)

  5. This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer than bits. Assume d-gap representation for the rest of the talk, unless stated otherwise Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of

  6. Symbolwise methods are more suited for coding d-gaps Text Compression Two classes of text compression methods • Symbolwise (or statistical) methods • Estimate probabilities of symbols - modeling step • Code one symbol at a time - coding step • Use shorter code for the most likely symbol • Usually based on either arithmetic or Huffman coding • Dictionary methods • Replace fragments of text with a single code word (typically an index to an entry in the dictionary). • eg: Ziv-Lempel coding, which replaces strings of characters with a pointer to a previous occurrence of the string. • No probability estimates needed

  7. model model text compressed text text encoder decoder Models can be static, semi-static or adaptive. Information content of a symbol s, denoted by I(s) is given by Shannon’s formula Entropy, or the average amount of information per symbol over the whole alphabet, denoted H, is given by Models

  8. A 0.05 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 Huffman Coding: Example

  9. A 0.05 0.1 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 Huffman Coding: Example

  10. Huffman Coding: Example Symbol Code Probability A 0000 0.05 B 0001 0.05 C 001 0.1 D 01 0.2 E 10 0.3 F 110 0.2 G 111 0.1 1.0 0 1 0.4 0 1 0.2 0.6 0 1 1 0 0.1 0.3 0 1 0 1 A 0.05 B 0.05 C 0.1 D 0.2 E 0.3 F 0.2 G 0.1

  11. A) 1.0000 Pr[c]=1/3 0.6667 Pr[b]=1/3 Interval used to code b 0.3333 Pr[a]=1/3 C) 0.0000 0.6667 0.6667 Pr[c]=3/6 B) 0.6667 Pr[c]=2/5 0.6501 Final interval (represents whole output) Pr[c]=1/4 Interval used to code c Pr[b]=2/6 0.6390 0.5834 Pr[a]=1/6 Pr[b]=2/4 0.6334 0.6334 Pr[b]=2/5 0.4167 Pr[a]=1/4 0.3333 0.6001 Pr[a]=1/5 0.5834 Arithmetic Coding String = bccb Alphabet = {a, b, c} Code = 0.64

  12. Arithmetic Coding produces near-optimal codes, given an accurate model Arithmetic Coding: Conclusions • High probability events do not reduce the size of the interval in the next step very much, whereas low-probability events do. • A small final interval requires many digits to specify a number guaranteed to be in the interval. • Number of bits required is proportional to the negative logarithm of the size of the interval. • A symbol s of probability Pr[s] contributes -log Pr[s] bits to the output.

  13. Methods for Inverted File Compression • Methods for compressing d-gap sizes can be classified into • global: each list is compressed using the same model • local: the model for compressing an inverted list is adjusted according to some parameter, like the frequency of the term • Global methods can be divided into • non-parameterized: probability distribution for d-gap sizes is predetermined. • parameterized: probability distribution is adjusted according to certain parameters of the collection. • By definition, local methods are parameterized.

  14. γ code: Number x is coded as a unary code for followed by a code of in binary. bits that represents δ code: Number of bits in binary is represented using γ code. For small integers, δ codes are longer than γ codes, but for large integers, the situation reverses. Non-parameterized models Unary code: An integer x > 0, is coded as (x-1) ‘1’ bits followed by a ‘0’ bit.

  15. Non-parameterized models Each code has an underlying probability distribution, which can be derived using Shannon’s formula. Probability assumed by unary is too small.

  16. Global parameterized models Probability that a random document contains a random term, Assuming a Bernoulli process, Arithmetic coding: Huffman-style coding (Golomb coding):

  17. Global observed frequencymodel • Use exact d-gap values and then use arithmetic or Huffman coding • Only slightly better than γ or δ code • Reason: pointers are not scattered randomly in the inverted file • Need local methods for any improvement

  18. Need an adaptive model that is good for clusters Local methods • Local Bernoulli • Use a different p for each inverted list • Use γ code for storing • Skewed Bernoulli • Local Bernoulli model is bad for clusters • Use a cross between γ and Golomb, with b=median gap size • Need to store b (use γ representation) • This is still a static model

  19. Interpolative code Consider an inverted list Documents 8, 9, 11, 12 and 13 form a cluster Can do better with a minimal binary code

  20. Performance of index compression methods Compression of inverted files in bits per pointer

  21. Signature Files • Each document is given a signature, that captures its content • Hash each document term to get several hash values • Bits corresponding to those values are set to 1 • Query processing: • Hash each query term to get several hash values • If a document has all bits corresponding to those values set to 1, it may contain the query term • False matches • set several bits for each term • make the signatures sufficiently long • Naïve representation: may have to read the entire signature file for each query term • Use bitslicing to save on disk transfer time

  22. Signature files: Conclusion • Design involves many tradeoffs • wide, sparse signatures reduce number of false matches • short, dense signatures require more disk accesses • For reasonable query times, requires more space than compressed inverted file • Inefficient for documents of varying sizes • Blocking makes simple queries difficult to answer • Text is not random

  23. Bitmaps • Simple representation: For each term in the lexicon, store a bitvector of length N. A bit is set if and only if the corresponding document contains the term. • Efficient for boolean queries • Enormous amount of storage requirement, even after removing stop words • Have been used to represent common words

  24. 1100 0101 1010 0000 0000 Compression of signature files and bitmaps • Signature files are already in compressed form • Decompression affects query time substantially • Lossy compression results in false matches • Bitmaps can be compressed by a significant amount Compressed code: 1100 : 0101, 1010 : 0010, 0011, 1000, 0100 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000

  25. Compressed inverted files are the most useful for indexing a collection of variable length text documents Comparison of indexing methods • All indexing methods are variations of the same basic idea!! • Signature files and inverted files require an order of magnitude less secondary storage than bitmaps • Signature files cause unnecessary access to the document collection unless signature width is large • Signature files are disastrous when record lengths vary a lot • Advantages of signature files • no need to keep lexicon in memory • better for conjunctive queries involving common terms

  26. Conclusion • For practical purposes, the best index compression algorithm is the local Bernoulli method (using Golomb coding) • Compressed inverted indices are almost always better than signature files and bitmaps in most practical situations, in terms of both space and response time for queries

More Related