1 / 17

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

cole-ware
Download Presentation

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo

  2. The Problem • Initial Problem • Text searching: Finding occurrences of a pattern string in a large (static) document • Solution • Text indexing: Trading space for time • New Problem • Succinct Text indexes: Reducing the space cost

  3. Pattern Searching • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. • Three types of Queries • Existential queries: Does P occur in T? • Cardinality queries: How many times does P occur in T? • Listing queries: Where does P occur in T?

  4. Text Indexing • Inverted files • Word index • Need to store the text as well as the index • Suffix trees • Efficient full-text index • 4n lg nto6n lg nbits! • Suffix arrays • n lg n bits in basic form, but • 3n lg n bits (with LCP data)

  5. Applications • Text databases • electronic encyclopedias, dictionaries, books, etc. • Web search engines • Google, Altavista, etc. • Bioinformatics • gene databases • More…

  6. Related Work • Compressed Suffix Arrays • Grossi & Vitter 2000 • Sadakane 2000 • Grossi, Gupta & Vitter 2003 • FM-index • Ferragina & Manzini 2000 & 2001

  7. Assumptions & Notation • Alphabet: Σ = {a, b} • Text: T[1..n] • T[n] = #, where a < # < b • Pattern: P[1..m]

  8. Permutations and Suffix Arrays • An observation • Permutations: n! • Suffix arrays: 2n-1 • Not all permutations are suffix arrays • An example • A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 • Text: abbaaba# • A permutation: 4, 7, 1, 5, 8, 2, 3, 6 • Not a suffix array of any binary text

  9. Two Features of Suffix Arrays Ascending-to-max Non-nesting Suffix Array 4 7 5 183 6 2 Another Permutation 4 7 1 582 3 6

  10. A Categorization Theorem • A permutation is a suffix arrayiffit is: • Ascending-to-max • Non-nesting • An immediate application: • Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

  11. 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 Ba: 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 Bb: Application: Space Efficient Suffix Array 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: Text: abaaabbaaabaabb#

  12. 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: Basic Searching Algorithm:Answering Cardinality Queries Basic Idea: backward search • Start from the end of the pattern P • For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m] P = aba

  13. More Algorithms and Tradeoffs • Answering listing queries • Speeding up the reporting of Occurrences of Long Patterns • Self-indexing • Time-space tradeoff: multi-level structure

  14. Putting it all together Three index structures:

  15. Conclusion • Summary • A theorem that characterizesa permutation as the suffix array of a binary string • An efficient algorithm checking whether a permutation is a suffix array • Three space efficient text indexing methods

  16. Conclusions (Continued) • Related subsequent work • Generalization to larger alphabets • Open problem • O(n)-bits text index supporting searching in O(m+occ) time.

  17. Thank You.

More Related