180 likes | 307 Views
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution
E N D
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo
The Problem • Initial Problem • Text searching: Finding occurrences of a pattern string in a large (static) document • Solution • Text indexing: Trading space for time • New Problem • Succinct Text indexes: Reducing the space cost
Pattern Searching • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. • Three types of Queries • Existential queries: Does P occur in T? • Cardinality queries: How many times does P occur in T? • Listing queries: Where does P occur in T?
Text Indexing • Inverted files • Word index • Need to store the text as well as the index • Suffix trees • Efficient full-text index • 4n lg nto6n lg nbits! • Suffix arrays • n lg n bits in basic form, but • 3n lg n bits (with LCP data)
Applications • Text databases • electronic encyclopedias, dictionaries, books, etc. • Web search engines • Google, Altavista, etc. • Bioinformatics • gene databases • More…
Related Work • Compressed Suffix Arrays • Grossi & Vitter 2000 • Sadakane 2000 • Grossi, Gupta & Vitter 2003 • FM-index • Ferragina & Manzini 2000 & 2001
Assumptions & Notation • Alphabet: Σ = {a, b} • Text: T[1..n] • T[n] = #, where a < # < b • Pattern: P[1..m]
Permutations and Suffix Arrays • An observation • Permutations: n! • Suffix arrays: 2n-1 • Not all permutations are suffix arrays • An example • A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 • Text: abbaaba# • A permutation: 4, 7, 1, 5, 8, 2, 3, 6 • Not a suffix array of any binary text
Two Features of Suffix Arrays Ascending-to-max Non-nesting Suffix Array 4 7 5 183 6 2 Another Permutation 4 7 1 582 3 6
A Categorization Theorem • A permutation is a suffix arrayiffit is: • Ascending-to-max • Non-nesting • An immediate application: • Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.
0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 Ba: 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 Bb: Application: Space Efficient Suffix Array 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: Text: abaaabbaaabaabb#
8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: Basic Searching Algorithm:Answering Cardinality Queries Basic Idea: backward search • Start from the end of the pattern P • For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m] P = aba
More Algorithms and Tradeoffs • Answering listing queries • Speeding up the reporting of Occurrences of Long Patterns • Self-indexing • Time-space tradeoff: multi-level structure
Putting it all together Three index structures:
Conclusion • Summary • A theorem that characterizesa permutation as the suffix array of a binary string • An efficient algorithm checking whether a permutation is a suffix array • Three space efficient text indexing methods
Conclusions (Continued) • Related subsequent work • Generalization to larger alphabets • Open problem • O(n)-bits text index supporting searching in O(m+occ) time.