1 / 28

Alexander Gelbukh Gelbukh

Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. Text transformation: meaning instead of strings Lexical analysis Stopwords Stemming POS, WSD, syntax, semantics

azana
Download Presentation

Alexander Gelbukh Gelbukh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceThe Art of Information RetrievalChapter 8: Indexing and Searching Alexander Gelbukh www.Gelbukh.com

  2. Previous Chapter: Conclusions • Text transformation: meaning instead of strings • Lexical analysis • Stopwords • Stemming • POS, WSD, syntax, semantics • Ontologies to collate similar stems • Text compression • Searchable (compress the query, then search) • Random access • Word-based statistical methods (Huffman) • Index compression

  3. Previous Chapter: Research topics • All computational linguistics • Improved POS tagging • Improved WSD • Uses of thesaurus • for user navigation • for collating similar terms • Better compression methods • Searchable compression • Random access

  4. Types of searching • Sequential • Small texts • Volatile, or space limited • Indexed • Semi-static • Space overhead First, we discuss indexed searching, then sequential

  5. Inverted files • Vocabulary: sqrt (n). Heaps’ law. 1GB  5M • Occurrences: n * 40% (stopwords) • positions (word, char), files, sections...

  6. Compression: Block addressing • Block addressing: 5% overhead • 256, 64K, ..., blocks (1, 2, ..., bytes) • Equal size (faster search) or logical sections (retrieval units)

  7. Searching in inverted files • Vocabulary search • Separate file • Many searching techniques • Lexicographic: log V (voc. size) = ½ log n (Heaps) • Hashing is not good for prefix search • Retrieval of occurrences • Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf) • Boolean operations. Context search • Merging • One list is shorter (Zipf law) Only inverted files allow sublinear both space & time Suffix trees and signature files don’t

  8. Building inverted file: 1 • Infinite memory? • Use trie to store vocabulary • append positions • O(n)

  9. Building inverted file: 2 • Finite memory? • Fill the memory • Write partial index; n/M pieces • Merge partial indices (hierarchically): n log (n/M) • Insertion: index, merge. n + n'log(n'/M) • Deleting: eliminate every occurrence. n • Very fast creating/maintenance

  10. Suffix trees • Text as one long string. No words. • Genetic databases • Complex queries • Compacted trie structure • Problem: space • For text retrieval, inverted files are better

  11. Suffix array • All suffixes (by position) in lexicographic order • Allows binary search • Much less space: 40% n • Supra-index: sampling, for better disk access

  12. Searching. Construction Searching • Patterns, prefixes, phrases. Not only words • Suffix tree: O(m), but: space (m = query size) • Suffix array: O(log n) (n = database size) • Construction of arrays: sorting • Large text: n2 log (M)/M, more than for inverted files • Skip details • Addition: n n' log (M)/M • Deletion: n

  13. Signature files • Usually worse than inverted files • Words are mapped to bit patterns • Blocks are mapped to ORs of their word patterns • If a block contains a word, all its bits are set • Sequential search for blocks • False drops! • Design of the hash function • Have to traverse the block • Good to search ANDs or proximity queries • bit patterns are ORed

  14. Boolean operations • Merging file (occurrences) lists • AND: to find repetitions • According to query syntax tree • Complexity linear in intermediate results • Can be slow if they are huge • There are optimization techniques • E.g.: merge small list with a big one by searching • This is a usual case (Zipf)

  15. Sequential search • Necessary part of many algorithms (e.g., block addr) • Brute force: O(nm) worst-case, O(n) on average • Knuth-Morris-Pratt: linear worst, but the same avrg • Boyer-Moore: n log(m) / m. Not all chars are examined! • If some part of the pattern was compared,no need to compare inside it: you analyze the pattern once • Shift-Or: uses logical operation on all 32 bits in parallel • BDM: automation. Complexity same as Boyer-Moore • Combination of BDM with bit parallelism

  16. Approximate string matching • Match with k errors • Levenshtein distance • Dynamic programming: O(mn), O(kn) • Automation: non-deterministic • Convert to deterministic: O(n), but huge structure • Bit-parallel: O(n), the fastest known • Filtering: sublinear! • k errors cannot alter k segments • multipattern exact search; detect suspicious places • uses approximate algorithm only when needed

  17. Regular expressions • Regular expressions • Automation: O (m 2m) + O (n) – bad for long patterns • Bit-parallel (simulates non-deterministic) • Using indices to search for words with errors • Inverted files: search in vocabulary, then each word • Suffix trees and Suffix arrays: the same algorithms!

  18. Structural queries • Ad-hoc index for structure • Indexing tags as words • Inverted files are goodsince they store occurrences in order

  19. Search over compression • Improves both space AND time (less disk operations) • Compress query and search • Huffman compression, words as symbols, bytes • (frequencies: most frequent shorter) • Search each word in the vocabulary  its code • More sophisticated algorithms • Compressed inverted files: less disk  less time Text and index compression can be combined

  20. ...compression • Suffix trees can be compressed almost to size ofsuffix arrays • Suffix arrays can’t be compressed (almost random),but can be constructed over compressed text • instead of Huffman, use a code that respects alphabetic order • almost the same compression • Signature files are sparse, so can be compressed • ratios up to 70%

  21. Research topics • Perhaps, new details in integration of compression and search • “Linguistic” indexing: allowing linguistic variations • Search in plural or only singular • Search with or without synonyms

  22. Conclusions • Inverted files seem to be the best option • Other structures are good for specific cases • Genetic databases • Sequential searching is an integral part of manyindexing-based search techniques • Many methods to improve sequential searching • Compression can be integrated with search

  23. Thank you! Till compensation lecture?

More Related