Modern Information Retrieval

Modern Information Retrieval Chapter 8 Indexing and Searching

It is worthwhile building and maintaining an index when the text collection is large and semi-static • semi-static: not often updated • consider search cost, space overhead, construction cost, and maintenance cost

Inverted file • a word-oriented index • vocabulary: the set of all different words in the text • occurrences: lists of the text positions where the words appear • the positions can refer to words or characters

the space required for the vocabulary is rather small while the occurrences demand much more space • between 30% and 40% of the text size • block addressing reduces space overhead to 5%

if the exact occurrence positions are required, an online search over the qualifying blocks has to be performed

searching the inverted file • vocabulary search: the words present in the query are separately searched in the vocabulary • retrieval of occurrences: the lists of the occurrences of all the words found are retrieved

manipulation of occurrences: the lists are traversed in synchronization to find places where all the words appear in sequence for a phrase query or appear close enough for a proximity query • how to efficiently manipulate the occurrences when block addressing is used?

constructing the inverted file

once constructed, it is written to disk in two files • the lists of occurrences are stored contiguously in the first file • in the second file, the vocabulary is stored in lexicographical order with a pointer for each word to its list in the first file

Suffix tree and suffix array • can be used to index any text character • allow to answer efficiently more complex queries • index points are selected form the text, which point to the beginning of the text positions which will be retrievable • each position is considered as a text suffix • each suffix is uniquely identified by its position

a suffix tree is a trie data structure built over all the suffixes of the text • the pointers to the suffixes are stored at the leaf nodes • the trie is compacted into a Patricia tree where unary paths are compressed • an indication of the next character position to consider is stored at the nodes which root a compressed path

space overhead: 120% to 240% over the text size

suffix arrays provide the same functionality with much less space requirements • An array containing all the pointers to the suffixes in lexicographical order • space requirements close to 40% overhead

allow binary searches done by comparing the contents of each pointer • supra-index over the suffix array is used to reduce the number of disk accesses • compare with an inverted file

processing phrase queries by searching the first words of the phrases • processing proximity queries by searching all the words in the queries • post-processing needed

Signature files • use a hash function to map words to bit masks of B bits • a text is divided in blocks of b words each • a bit mask of size B is assigned to each block by bitwise ORing the signatures of all the words in the block

if a word is present in a block, all the bits set in its signature are also set in the bit mask of the block • when a bit is set in the mask of the query word but not in the mask of the block, the word is not present in the block

false drop: all the corresponding bits are set while the word is not in the block • signature file design principle: make the probability of a false drop low while keeping the signature file as short as possible • searching a single word by hashing it to a bit mask W, checking whether , and verifying if the word is actually there

process a phrase searching by bitwise ORing the signatures of all the words in the query • the probability of false drops is reduced • care has to be exercised at block boundaries by overlapping words in consecutive blocks

Modern Information Retrieval