1 / 25

Alexander Gelbukh Gelbukh

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets

keitha
Download Presentation

Alexander Gelbukh Gelbukh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 4 (book chapter 8): Indexing and Searching Alexander Gelbukh www.Gelbukh.com

  2. Previous Chapter: Conclusions • Main measures: Precision & Recall. • For sets • Rankings are evaluated through initial subsets • There are measures that combine them into one • Involve user-defined preferences • Many (other) characteristics • An algorithm can be good at some and bad at others • Averages are used, but not always are meaningful • Reference collection exists with known answers to evaluate new algorithms

  3. Previous Chapter: Research topics • Different types of interfaces • Interactive systems: • What measures to use? • Such as infromativeness

  4. Types of searching • Indexed • Semi-static • Space overhead • Sequential • Small texts • Volatile, or space limited • Combined • Index into large portions, then sequential inside portion • Best combination of speed / overhead

  5. Inverted files • Vocabulary: sqrt (n). Heaps’ law. 1GB  5M • Occurrences: n * 40% (stopwords) • positions (word, char), files, sections...

  6. Compression: Block addressing • Block addressing: 5% overhead • 256, 64K, ..., blocks (1, 2, ..., bytes) • Equal size (faster search) or logical sections (retrieval units)

  7. Searching in inverted files • Vocabulary search • Separate file • Many searching techniques • Lexicographic: log V (voc. size) = ½ log n (Heaps) • Hashing is not good for prefix search • Retrieval of occurrences • Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf) • Boolean operations. Context search • Merging occurrences • For AND: One list is usually shorter (Zipf law)  sublinear! • Only inverted files allow sublinear both space & time • Suffix trees and signature files don’t

  8. Building inverted file: 1 • Infinite memory? Use trie to store vocabulary. O(n) • append positions • Finite memory? Build in chunks, merge. Almost O(n) • Insertion: index + merge. Deleting: O(n). Very fast.

  9. Suffix trees • Text as one long string. No words. • Genetic databases • Complex queries • Compacted trie structure • Problem: space • For text retrieval, inverted files are better

  10. Info for tree comes from the text itself

  11. Suffix array • All suffixes (by position) in lexicographic order • Allows binary search • Much less space: 40% n • Supra-index: sampling, for better disk access

  12. Suffix tree and suffix array:Searching. Construction Searching • Patterns, prefixes, phrases. Not only words • Suffix tree: O(m), but: space (m = query size) • Suffix array: O(log n) (n = database size) • Construction of arrays: sorting • Large text: n2 log (M)/M, more than for inverted files • Skip details • Addition: n n' log (M)/M. (n' is the size of new portion) • Deletion: n

  13. Signature files • Usually worse than inverted files • Words are mapped to bit patterns • Blocks are mapped to ORs of their word patterns • If a block contains a word, all bits of its pattern are set • Sequential search for blocks • False drops! • Design of the hash function • Have to traverse the block • Good to search ANDs or proximity queries • bit patterns are ORed

  14. False drop: letters in 2nd block

  15. Boolean operations • Merging file (occurrences) lists • AND: to find repetitions • According to query syntax tree • Complexity linear in intermediate results • Can be slow if they are huge • There are optimization techniques • E.g.: merge small list with a big one by searching • This is a usual case (Zipf)

  16. Sequential search • Necessary part of many algorithms (e.g., block addr) • Brute force: O(nm) worst-case, O(n) on average • MANY faster algorithms, but more complicated • See the book

  17. Approximate string matching • Match with k errors, select the one with min k • Levenshtein distance between strings s1 and s2 • The minimum number of editing operations to make onefrom another • Symmetric for standard sets of operations • Operations: deletion, addition, change • Sometimes weighted • Solution: dynamic programming. O(mn), O(kn) • m, n are lengths of the two strings

  18. Regular expressions • Regular expressions • Automation: O (m 2m) + O (n) – bad for long patterns • There are better methods, see book • Using indices to search for words with errors • Inverted files: search in vocabulary • Suffix trees and Suffix arrays: the same algorithms as forsearch without errors! Just allow deviations from the path

  19. Search over compression • Improves both space AND time (less disk operations) • Compress query and search • Huffman compression, words as symbols, bytes • (frequencies: most frequent shorter) • Search each word in the vocabulary  its code • More sophisticated algorithms • Compressed inverted files: less disk  less time Text and index compression can be combined

  20. ...compression • Suffix trees can be compressed almost to size ofsuffix arrays • Suffix arrays can’t be compressed (almost random),but can be constructed over compressed text • instead of Huffman, use a code that respects alphabetic order • almost the same compression • Signature files are sparse, so can be compressed • ratios up to 70%

  21. Research topics • Perhaps, new details in integration of compression and search • “Linguistic” indexing: allowing linguistic variations • Search in plural or only singular • Search with or without synonyms

  22. Conclusions • Inverted files seem to be the best option • Other structures are good for specific cases • Genetic databases • Sequential searching is an integral part of manyindexing-based search techniques • Many methods to improve sequential searching • Compression can be integrated with search

  23. Thank you! Till April 26, 6 pm

More Related