1 / 76

Indexing and Searching

Indexing and Searching. J. H. Wang Feb. 20, 2008. Text. User Interface. 4, 10. user need. Text. Text Operations. 6, 7. logical view. logical view. Query Operations. DB Manager Module. Indexing. user feedback. 5. 8. inverted file. query. Searching. Index. 8.

mikasi
Download Presentation

Indexing and Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing and Searching J. H. Wang Feb. 20, 2008

  2. Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process

  3. Outline • Conventional text retrieval systems (8.1-8.3, Salton) • File Structures for Indexing and Searching (Chap. 8) • Inverted files • Suffix trees and suffix arrays • Signature files • Sequential searching • Pattern matching

  4. Conventional Text Retrieval Systems • Database management, e.g. employee DB • Structured records • Precise meaning for attribute values • Exact match • Text retrieval, e.g. bibliographic systems • Structured attributes and unstructured content • Index terms • Imprecise representation of the text • Approximate or partial matching

  5. Conceptual Information Retrieval Queries Similaritycomputation Documents Retrieval of similar terms

  6. Expanded Text Retrieval System Queries Formalstatements Similaritycomputation Indexeddocuments Documents Negotiationand analysis(Query formulation) Text indexing(Content Analysis) Retrieval of similar terms Taipei city government Taipei travel guide Wiki page on Taipei Taipei 101 Taipei times … Taipei

  7. Representation • Documents • Indexed terms (or term vectors) • Unweighted or weighted • Queries • Unweighted or weighted terms • Boolean operators: or, and, not • E.g. Taiwan AND NOT Taipei • Efficiency

  8. Data Structure • Requirement • Fast access to documents • Very large number of index terms • For each term a separate index is constructed that stores the document identifiers for all documents identified by that term • Inverted index (or inverted file)

  9. Inverted Index • The complete file is represented as an array of indexed documents.

  10. Inverted-file Process • The document-term array is inverted (actually transposed).

  11. Inverted-file Process • The rows are manipulated according to query specification. (list-merging) • Ex: Query = (term 2 and term 3) 1 1 0 0 0 1 1 1-------------------------------------------- 0 1 0 0 • Ex: Query = ((T1 or T2) and not T3)

  12. Extensions of Inverted Index • Distance Constraints • Term Weights • Synonym Specification • Term Truncation

  13. Distance Constraints • Nearness parameters • Within sentence: terms cooccur in a common sentence • Adjacency: terms occur adjacently in the text

  14. Implementation • To include term-location information in the inverted index information: {D345, D348, D350, …}retrieval: {D123, D128, D345, …} • Cost: size of the indexes • To include sentence numbers for all term occurrences in the inverted index information: {D345, 25; D345, 37; D348, 10; D350, 8;…}retrieval: {D123, 5; D128, 25; D345, 37; D345, 40;…}

  15. To include paragraph numbers, sentence numbers within paragraphs, word numbers within sentences in the inverted index information: {D345, 2, 3, 5}retrieval: {D345, 2, 3, 6} • Ex: (information adjacent retrieval)(information within five words retrieval)

  16. Term Weights • Term-importance weights • Di = {Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} • Issues • How to generate term weights? (more on this later) • How to apply term weights? • Vector queries: the sum of the weights of all document terms that match the given query • Boolean queries: (more complex)

  17. Term Weights (for Boolean Queries) • Transforming each query into sum-of-products form (or disjunctive normal form) • The weight of each conjunct is the minimum term weight of any document term in the conjunct • The document weight is the maximum of all the conjunct weights

  18. An Example • Example: Q=(T1 and T2) or T3Document Conjunct QueryVectors Weights Weight(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6) 0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1) 0.2 0.1 0.2D1 is preferred.

  19. Synonym Specification • (T1 and T2) or T3 • ((T1 or S1) and T2) or (T3 or S3) • Term Truncation (or stemming) • Removing suffixes and/or prefixes • Ex:PSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …

  20. File Structures for Indexing and Searching

  21. Introduction • How to retrieval information? • A simple alternative is to search the whole text sequentially (online search) • Another option is to build data structures over the text (called indices) to speed up the search

  22. Introduction • Indexing techniques • Inverted files • Suffix arrays • Signature files

  23. Notation • n: the size of the text • m: the length of the pattern (m << n) • v: the size of the vocabulary • M: the amount of main memory available

  24. Inverted Files • Definition:an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. • Structure of inverted file: • Vocabulary: is the set of all distinct words in the text • Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)

  25. Example • Text: • Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful Vocabulary Occurrences beautiful flowers garden house 70 45, 58 18, 29 6

  26. Space Requirements • The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n), where  is a constant between 0.4 and 0.6 in practice (sublinear) • On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n) • To reduce space requirements, a technique called block addressing is used

  27. Block Addressing • The text is divided in blocks • The occurrences point to the blocks where the word appears • Advantages: • the number of pointers is smaller than positions • all the occurrences of a word inside a single block are collapsed to one reference • Disadvantages: • online search over the qualifying blocks if exact positions are required

  28. Example • Text: • Inverted file: Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful Vocabulary Occurrences beautiful flowers garden house 4 3 2 1

  29. Searching • The search algorithm on an inverted index follows three steps • Vocabulary search: the words present in the query are searched in the vocabulary • Retrieval of occurrences: the lists of the occurrences of all words found are retrieved • Manipulation of occurrences: the occurrences are processed to solve the query

  30. Searching • Searching task on an inverted file always starts in the vocabulary (It is better to store the vocabulary in a separate file) • The structures most used to store the vocabulary are hashing, tries or B-trees • Hashing, tries: O(m) • An alternative is simply storing the words in lexicographical order (cheaper in space and very competitive with O(log v) cost)

  31. Construction • All the vocabulary is kept in a suitable data structure storing for each word a list of its occurrences • Each word of the text is read and searched in the vocabulary • If it is not found, it is added to the vocabulary with a empty list of occurrences and the new position is added to the end of its list of occurrences

  32. Example • Text: • Vocabulary trie 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful: 70 ‘b’ ‘f’ flower: 45, 58 ‘g’ garden: 18, 29 ‘h’ house: 6

  33. Construction • Once the text is exhausted, the vocabulary is written to disk with the list of occurrences. Two files are created: • in the first file, the list of occurrences are stored contiguously (posting file) • in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time • The overall process is O(n) worst-case time • Not practical for large texts

  34. Construction • An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text • Once the text is exhausted, a number of partial indices Ii exist on disk • The partial indices are merged to obtain the final index

  35. Example I 1...8 final index 7 level 3 I 1...4 I 5...8 3 6 level 2 I 1...2 I 3...4 I 5...6 I 7...8 level 1 1 2 4 5 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 initial dumps

  36. Construction • The total time to generate partial indices is O(n) • The number of partial indices is O(n/M) • To merge the O(n/M) partial indices,log2(n/M) merging levelsare necessary • The total cost of this algorithm is O(n log(n/M))

  37. Summary on Inverted File • Inverted file is probably the most adequate indexing technique for database text • The indices are appropriate when the text collection is large and semi-static • Otherwise, if the text collection is volatile online searching is the only option • Some techniques combine online and indexed searching

  38. Suffix Trees and Suffix Arrays • Each position in the text is considered as a text suffix • Index points are selected form the text, which point to the beginning of the text positions which will be retrievable • The problem with suffix trees is its space overhead

  39. Example • Text: • Suffixes • house has a garden. The garden has many flowers. The flowers are beautiful • garden. The garden has many flowers. The flowers are beautiful • garden has many flowers. The flowers are beautiful • flowers. The flowers are beautiful • flowers are beautiful • beautiful 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful

  40. Example • Text: • Suffix Trie 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful 70 ‘ ’ 58 ‘b’ ‘e’ ‘r’ ‘s’ ‘l’ ‘o’ ‘w’ ‘f’ 45 ‘.’ ‘g’ ‘ ’ 29 ‘e’ ‘n’ ‘a’ ‘r’ ‘d’ ‘h’ 18 ‘.’ 6

  41. Example • Text: • Suffix Tree 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful 70 45 ‘b’ ‘.’ ‘f’ 8 58 ‘ ’ 1 ‘g’ ‘.’ 18 ‘h’ 7 29 ‘ ’ 6

  42. Suffix Arrays • An array containing all the pointers to the text suffixes listed in lexicographical order • The space requirements are almost the same as those for inverted indices • The main drawbacks of suffix array are its costlyconstruction process • Allow binary searches done by comparing the contents of each pointer • Supra-indices (for large suffix array) • The space requirements of suffix array with vocabulary supra-index are exactly the same as for inverted indices

  43. Example • Text: • Suffix Array • Supra Index (l=4, b=2) 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful

  44. Example • Text: • Vocabulary Supra-Index • Suffix Array • Inverted List 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful

  45. Construction of Suffix Arrays for Large Texts Small text 1 2 Small suffix array Long text 2 3 Long suffix array Counters 3 3 Final suffix array

  46. Signature Files • Characteristics • Word-oriented index structures based on hashing • Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index • Suitable for not very large texts • Inverted files outperform signature files for most applications

  47. Construction and Search • Word-oriented index structures base on hashing • Maps words to bit masks of B bits • Divides the text in blocks of b words each • The mask is obtained by bitwise ORing the signatures of all the words in the text block. • Search • Hash the query to a bit mask W • If W & Bi = W, the text block may contain the word • For all candidate blocks, an online traversal must be performed to verify if the word is actually there

  48. Block 4: 001100 OR 100001 101101 Example • Four blocks: • This is a text. A text has many words. Words are made from letters. 000101 110101 100100 101101 • Hash(text) = 000101 • Hash(many)= 110000 • Hash(words)= 100100 • Hash(made)= 001100 • Hash(letters)= 100001

  49. False Drop • Assumes that l bits are randomly set in the mask • Let a=l/B • For b words, the probability that a given bit of the mask is set is 1-(1-1/B)bl1-e-ba • Hence, the probability that the l random bits are also set is Fd =(1-e-ba)aB  False alarm • Fd is minimized for a=ln(2)/b • Fd = 2-ll = B ln2/b

  50. Comparisons • Signature files • Use hashing techniques to produce an index • advantage • storage overhead is small (10%-20%) • disadvantages • the search time on the index is linear • some answers may not match the query, thus filtering must be done

More Related