1 / 22

Modern Information Retrieval

Modern Information Retrieval. Chapter 8 Indexing and Searching. It is worthwhile building and maintaining an index when the text collection is large and semi-static semi-static: not often updated consider search cost, space overhead, construction cost, and maintenance cost. Inverted file

deepak
Download Presentation

Modern Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modern Information Retrieval Chapter 8 Indexing and Searching

  2. It is worthwhile building and maintaining an index when the text collection is large and semi-static • semi-static: not often updated • consider search cost, space overhead, construction cost, and maintenance cost

  3. Inverted file • a word-oriented index • vocabulary: the set of all different words in the text • occurrences: lists of the text positions where the words appear • the positions can refer to words or characters

  4. the space required for the vocabulary is rather small while the occurrences demand much more space • between 30% and 40% of the text size • block addressing reduces space overhead to 5%

  5. if the exact occurrence positions are required, an online search over the qualifying blocks has to be performed

  6. searching the inverted file • vocabulary search: the words present in the query are separately searched in the vocabulary • retrieval of occurrences: the lists of the occurrences of all the words found are retrieved

  7. manipulation of occurrences: the lists are traversed in synchronization to find places where all the words appear in sequence for a phrase query or appear close enough for a proximity query • how to efficiently manipulate the occurrences when block addressing is used?

  8. constructing the inverted file

  9. once constructed, it is written to disk in two files • the lists of occurrences are stored contiguously in the first file • in the second file, the vocabulary is stored in lexicographical order with a pointer for each word to its list in the first file

  10. Suffix tree and suffix array • can be used to index any text character • allow to answer efficiently more complex queries • index points are selected form the text, which point to the beginning of the text positions which will be retrievable • each position is considered as a text suffix • each suffix is uniquely identified by its position

  11. a suffix tree is a trie data structure built over all the suffixes of the text • the pointers to the suffixes are stored at the leaf nodes • the trie is compacted into a Patricia tree where unary paths are compressed • an indication of the next character position to consider is stored at the nodes which root a compressed path

  12. space overhead: 120% to 240% over the text size

  13. suffix arrays provide the same functionality with much less space requirements • An array containing all the pointers to the suffixes in lexicographical order • space requirements close to 40% overhead

  14. allow binary searches done by comparing the contents of each pointer • supra-index over the suffix array is used to reduce the number of disk accesses • compare with an inverted file

  15. processing phrase queries by searching the first words of the phrases • processing proximity queries by searching all the words in the queries • post-processing needed

  16. Signature files • use a hash function to map words to bit masks of B bits • a text is divided in blocks of b words each • a bit mask of size B is assigned to each block by bitwise ORing the signatures of all the words in the block

  17. if a word is present in a block, all the bits set in its signature are also set in the bit mask of the block • when a bit is set in the mask of the query word but not in the mask of the block, the word is not present in the block

  18. false drop: all the corresponding bits are set while the word is not in the block • signature file design principle: make the probability of a false drop low while keeping the signature file as short as possible • searching a single word by hashing it to a bit mask W, checking whether , and verifying if the word is actually there

  19. process a phrase searching by bitwise ORing the signatures of all the words in the query • the probability of false drops is reduced • care has to be exercised at block boundaries by overlapping words in consecutive blocks

More Related