Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceThe Art of Information RetrievalChapter 8: Indexing and Searching Alexander Gelbukh www.Gelbukh.com

Previous Chapter: Conclusions • Text transformation: meaning instead of strings • Lexical analysis • Stopwords • Stemming • POS, WSD, syntax, semantics • Ontologies to collate similar stems • Text compression • Searchable (compress the query, then search) • Random access • Word-based statistical methods (Huffman) • Index compression

Previous Chapter: Research topics • All computational linguistics • Improved POS tagging • Improved WSD • Uses of thesaurus • for user navigation • for collating similar terms • Better compression methods • Searchable compression • Random access

Types of searching • Sequential • Small texts • Volatile, or space limited • Indexed • Semi-static • Space overhead First, we discuss indexed searching, then sequential

Inverted files • Vocabulary: sqrt (n). Heaps’ law. 1GB  5M • Occurrences: n * 40% (stopwords) • positions (word, char), files, sections...

Compression: Block addressing • Block addressing: 5% overhead • 256, 64K, ..., blocks (1, 2, ..., bytes) • Equal size (faster search) or logical sections (retrieval units)

Searching in inverted files • Vocabulary search • Separate file • Many searching techniques • Lexicographic: log V (voc. size) = ½ log n (Heaps) • Hashing is not good for prefix search • Retrieval of occurrences • Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf) • Boolean operations. Context search • Merging • One list is shorter (Zipf law) Only inverted files allow sublinear both space & time Suffix trees and signature files don’t

Building inverted file: 1 • Infinite memory? • Use trie to store vocabulary • append positions • O(n)

Building inverted file: 2 • Finite memory? • Fill the memory • Write partial index; n/M pieces • Merge partial indices (hierarchically): n log (n/M) • Insertion: index, merge. n + n'log(n'/M) • Deleting: eliminate every occurrence. n • Very fast creating/maintenance

Suffix trees • Text as one long string. No words. • Genetic databases • Complex queries • Compacted trie structure • Problem: space • For text retrieval, inverted files are better

Suffix array • All suffixes (by position) in lexicographic order • Allows binary search • Much less space: 40% n • Supra-index: sampling, for better disk access

Searching. Construction Searching • Patterns, prefixes, phrases. Not only words • Suffix tree: O(m), but: space (m = query size) • Suffix array: O(log n) (n = database size) • Construction of arrays: sorting • Large text: n2 log (M)/M, more than for inverted files • Skip details • Addition: n n' log (M)/M • Deletion: n

Signature files • Usually worse than inverted files • Words are mapped to bit patterns • Blocks are mapped to ORs of their word patterns • If a block contains a word, all its bits are set • Sequential search for blocks • False drops! • Design of the hash function • Have to traverse the block • Good to search ANDs or proximity queries • bit patterns are ORed

Boolean operations • Merging file (occurrences) lists • AND: to find repetitions • According to query syntax tree • Complexity linear in intermediate results • Can be slow if they are huge • There are optimization techniques • E.g.: merge small list with a big one by searching • This is a usual case (Zipf)

Sequential search • Necessary part of many algorithms (e.g., block addr) • Brute force: O(nm) worst-case, O(n) on average • Knuth-Morris-Pratt: linear worst, but the same avrg • Boyer-Moore: n log(m) / m. Not all chars are examined! • If some part of the pattern was compared,no need to compare inside it: you analyze the pattern once • Shift-Or: uses logical operation on all 32 bits in parallel • BDM: automation. Complexity same as Boyer-Moore • Combination of BDM with bit parallelism

Approximate string matching • Match with k errors • Levenshtein distance • Dynamic programming: O(mn), O(kn) • Automation: non-deterministic • Convert to deterministic: O(n), but huge structure • Bit-parallel: O(n), the fastest known • Filtering: sublinear! • k errors cannot alter k segments • multipattern exact search; detect suspicious places • uses approximate algorithm only when needed

Regular expressions • Regular expressions • Automation: O (m 2m) + O (n) – bad for long patterns • Bit-parallel (simulates non-deterministic) • Using indices to search for words with errors • Inverted files: search in vocabulary, then each word • Suffix trees and Suffix arrays: the same algorithms!

Structural queries • Ad-hoc index for structure • Indexing tags as words • Inverted files are goodsince they store occurrences in order

Search over compression • Improves both space AND time (less disk operations) • Compress query and search • Huffman compression, words as symbols, bytes • (frequencies: most frequent shorter) • Search each word in the vocabulary  its code • More sophisticated algorithms • Compressed inverted files: less disk  less time Text and index compression can be combined

...compression • Suffix trees can be compressed almost to size ofsuffix arrays • Suffix arrays can’t be compressed (almost random),but can be constructed over compressed text • instead of Huffman, use a code that respects alphabetic order • almost the same compression • Signature files are sparse, so can be compressed • ratios up to 70%

Research topics • Perhaps, new details in integration of compression and search • “Linguistic” indexing: allowing linguistic variations • Search in plural or only singular • Search with or without synonyms

Conclusions • Inverted files seem to be the best option • Other structures are good for specific cases • Genetic databases • Sequential searching is an integral part of manyindexing-based search techniques • Many methods to improve sequential searching • Compression can be integrated with search

Thank you! Till compensation lecture?

Alexander Gelbukh Gelbukh