Indexing and Searching

Indexing and Searching J. H. Wang Feb. 20, 2008

Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process

Outline • Conventional text retrieval systems (8.1-8.3, Salton) • File Structures for Indexing and Searching (Chap. 8) • Inverted files • Suffix trees and suffix arrays • Signature files • Sequential searching • Pattern matching

Conventional Text Retrieval Systems • Database management, e.g. employee DB • Structured records • Precise meaning for attribute values • Exact match • Text retrieval, e.g. bibliographic systems • Structured attributes and unstructured content • Index terms • Imprecise representation of the text • Approximate or partial matching

Conceptual Information Retrieval Queries Similaritycomputation Documents Retrieval of similar terms

Expanded Text Retrieval System Queries Formalstatements Similaritycomputation Indexeddocuments Documents Negotiationand analysis(Query formulation) Text indexing(Content Analysis) Retrieval of similar terms Taipei city government Taipei travel guide Wiki page on Taipei Taipei 101 Taipei times … Taipei

Representation • Documents • Indexed terms (or term vectors) • Unweighted or weighted • Queries • Unweighted or weighted terms • Boolean operators: or, and, not • E.g. Taiwan AND NOT Taipei • Efficiency

Data Structure • Requirement • Fast access to documents • Very large number of index terms • For each term a separate index is constructed that stores the document identifiers for all documents identified by that term • Inverted index (or inverted file)

Inverted Index • The complete file is represented as an array of indexed documents.

Inverted-file Process • The document-term array is inverted (actually transposed).

Inverted-file Process • The rows are manipulated according to query specification. (list-merging) • Ex: Query = (term 2 and term 3) 1 1 0 0 0 1 1 1-------------------------------------------- 0 1 0 0 • Ex: Query = ((T1 or T2) and not T3)

Extensions of Inverted Index • Distance Constraints • Term Weights • Synonym Specification • Term Truncation

Distance Constraints • Nearness parameters • Within sentence: terms cooccur in a common sentence • Adjacency: terms occur adjacently in the text

Implementation • To include term-location information in the inverted index information: {D345, D348, D350, …}retrieval: {D123, D128, D345, …} • Cost: size of the indexes • To include sentence numbers for all term occurrences in the inverted index information: {D345, 25; D345, 37; D348, 10; D350, 8;…}retrieval: {D123, 5; D128, 25; D345, 37; D345, 40;…}

To include paragraph numbers, sentence numbers within paragraphs, word numbers within sentences in the inverted index information: {D345, 2, 3, 5}retrieval: {D345, 2, 3, 6} • Ex: (information adjacent retrieval)(information within five words retrieval)

Term Weights • Term-importance weights • Di = {Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} • Issues • How to generate term weights? (more on this later) • How to apply term weights? • Vector queries: the sum of the weights of all document terms that match the given query • Boolean queries: (more complex)

Term Weights (for Boolean Queries) • Transforming each query into sum-of-products form (or disjunctive normal form) • The weight of each conjunct is the minimum term weight of any document term in the conjunct • The document weight is the maximum of all the conjunct weights

An Example • Example: Q=(T1 and T2) or T3Document Conjunct QueryVectors Weights Weight(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6) 0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1) 0.2 0.1 0.2D1 is preferred.

Synonym Specification • (T1 and T2) or T3 • ((T1 or S1) and T2) or (T3 or S3) • Term Truncation (or stemming) • Removing suffixes and/or prefixes • Ex:PSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …

File Structures for Indexing and Searching

Introduction • How to retrieval information? • A simple alternative is to search the whole text sequentially (online search) • Another option is to build data structures over the text (called indices) to speed up the search

Introduction • Indexing techniques • Inverted files • Suffix arrays • Signature files

Notation • n: the size of the text • m: the length of the pattern (m << n) • v: the size of the vocabulary • M: the amount of main memory available

Inverted Files • Definition:an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. • Structure of inverted file: • Vocabulary: is the set of all distinct words in the text • Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)

Example • Text: • Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful Vocabulary Occurrences beautiful flowers garden house 70 45, 58 18, 29 6

Space Requirements • The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n), where  is a constant between 0.4 and 0.6 in practice (sublinear) • On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n) • To reduce space requirements, a technique called block addressing is used

Block Addressing • The text is divided in blocks • The occurrences point to the blocks where the word appears • Advantages: • the number of pointers is smaller than positions • all the occurrences of a word inside a single block are collapsed to one reference • Disadvantages: • online search over the qualifying blocks if exact positions are required

Example • Text: • Inverted file: Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful Vocabulary Occurrences beautiful flowers garden house 4 3 2 1

Searching • The search algorithm on an inverted index follows three steps • Vocabulary search: the words present in the query are searched in the vocabulary • Retrieval of occurrences: the lists of the occurrences of all words found are retrieved • Manipulation of occurrences: the occurrences are processed to solve the query

Searching • Searching task on an inverted file always starts in the vocabulary (It is better to store the vocabulary in a separate file) • The structures most used to store the vocabulary are hashing, tries or B-trees • Hashing, tries: O(m) • An alternative is simply storing the words in lexicographical order (cheaper in space and very competitive with O(log v) cost)

Construction • All the vocabulary is kept in a suitable data structure storing for each word a list of its occurrences • Each word of the text is read and searched in the vocabulary • If it is not found, it is added to the vocabulary with a empty list of occurrences and the new position is added to the end of its list of occurrences

Example • Text: • Vocabulary trie 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful: 70 ‘b’ ‘f’ flower: 45, 58 ‘g’ garden: 18, 29 ‘h’ house: 6

Construction • Once the text is exhausted, the vocabulary is written to disk with the list of occurrences. Two files are created: • in the first file, the list of occurrences are stored contiguously (posting file) • in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time • The overall process is O(n) worst-case time • Not practical for large texts

Construction • An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text • Once the text is exhausted, a number of partial indices Ii exist on disk • The partial indices are merged to obtain the final index

Example I 1...8 final index 7 level 3 I 1...4 I 5...8 3 6 level 2 I 1...2 I 3...4 I 5...6 I 7...8 level 1 1 2 4 5 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 initial dumps

Construction • The total time to generate partial indices is O(n) • The number of partial indices is O(n/M) • To merge the O(n/M) partial indices,log2(n/M) merging levelsare necessary • The total cost of this algorithm is O(n log(n/M))

Summary on Inverted File • Inverted file is probably the most adequate indexing technique for database text • The indices are appropriate when the text collection is large and semi-static • Otherwise, if the text collection is volatile online searching is the only option • Some techniques combine online and indexed searching

Suffix Trees and Suffix Arrays • Each position in the text is considered as a text suffix • Index points are selected form the text, which point to the beginning of the text positions which will be retrievable • The problem with suffix trees is its space overhead

Example • Text: • Suffixes • house has a garden. The garden has many flowers. The flowers are beautiful • garden. The garden has many flowers. The flowers are beautiful • garden has many flowers. The flowers are beautiful • flowers. The flowers are beautiful • flowers are beautiful • beautiful 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful

Example • Text: • Suffix Trie 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful 70 ‘ ’ 58 ‘b’ ‘e’ ‘r’ ‘s’ ‘l’ ‘o’ ‘w’ ‘f’ 45 ‘.’ ‘g’ ‘ ’ 29 ‘e’ ‘n’ ‘a’ ‘r’ ‘d’ ‘h’ 18 ‘.’ 6

Example • Text: • Suffix Tree 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful 70 45 ‘b’ ‘.’ ‘f’ 8 58 ‘ ’ 1 ‘g’ ‘.’ 18 ‘h’ 7 29 ‘ ’ 6

Suffix Arrays • An array containing all the pointers to the text suffixes listed in lexicographical order • The space requirements are almost the same as those for inverted indices • The main drawbacks of suffix array are its costlyconstruction process • Allow binary searches done by comparing the contents of each pointer • Supra-indices (for large suffix array) • The space requirements of suffix array with vocabulary supra-index are exactly the same as for inverted indices

Example • Text: • Suffix Array • Supra Index (l=4, b=2) 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful

Example • Text: • Vocabulary Supra-Index • Suffix Array • Inverted List 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful

Construction of Suffix Arrays for Large Texts Small text 1 2 Small suffix array Long text 2 3 Long suffix array Counters 3 3 Final suffix array

Signature Files • Characteristics • Word-oriented index structures based on hashing • Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index • Suitable for not very large texts • Inverted files outperform signature files for most applications

Construction and Search • Word-oriented index structures base on hashing • Maps words to bit masks of B bits • Divides the text in blocks of b words each • The mask is obtained by bitwise ORing the signatures of all the words in the text block. • Search • Hash the query to a bit mask W • If W & Bi = W, the text block may contain the word • For all candidate blocks, an online traversal must be performed to verify if the word is actually there

Block 4: 001100 OR 100001 101101 Example • Four blocks: • This is a text. A text has many words. Words are made from letters. 000101 110101 100100 101101 • Hash(text) = 000101 • Hash(many)= 110000 • Hash(words)= 100100 • Hash(made)= 001100 • Hash(letters)= 100001

False Drop • Assumes that l bits are randomly set in the mask • Let a=l/B • For b words, the probability that a given bit of the mask is set is 1-(1-1/B)bl1-e-ba • Hence, the probability that the l random bits are also set is Fd =(1-e-ba)aB  False alarm • Fd is minimized for a=ln(2)/b • Fd = 2-ll = B ln2/b

Comparisons • Signature files • Use hashing techniques to produce an index • advantage • storage overhead is small (10%-20%) • disadvantages • the search time on the index is linear • some answers may not match the query, thus filtering must be done

Indexing and Searching