150 likes | 362 Views
Suffix Array: Data structures and applications. Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004. Outline. Introduction Suffix array and enhanced suffix array An example - Is P a substring of S ? Conclusions References. Introduction Why suffix array?.
E N D
Suffix Array:Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004
Outline • Introduction • Suffix array and enhanced suffix array • An example - Is P a substring of S? • Conclusions • References
IntroductionWhy suffix array? • Suffix tree’s drawbacks : • Space consumption: 20n (n=|S|, string length) [Kur99] • Memory Locality: Loss of efficiency • Suffix array (PAT array): • Manber & Myers[Man93] • also Gonnet & Baeza-Yates[Gon93]
Suffix arrayDefinition & an example • Informal Definition: • same information as a suffix tree but more compact. • Suffixes in an alphabetic order • Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Brown’s Assign 2’s Handout
Suffix array isn’t perfect either • Less space: 4n but • direct constructing time: O(nlogn) • Linear constructing time via suffix tree but sacrifices space • Binary search for a substring P takes O(mlogn) (m=|P|) • So enhanced suffix array!
0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Enhanced Suffix Array=suffix array+ additional tables Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04]
0-[0,10] $ a ca t 1-[0,5] 2-[6,7] 1-[8,9] a..$ 10 t..$ a..$ $ 5 a 1 9 7 ca t a..$ t..$ c..$ a..$ a..$ t..$ 2-[0..1] 3-[2..3] 2-[4..5] 8 4 3 2 0 6 Enhanced Suffix Array(2) Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$
0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Enhanced Suffix Array(3)-the more tables, the more likely a suffix tree? • ChildTab: Up, down and next fields to record the parent-child,sibling relationships. • The lcp-interval tree is like a suffix tree. However, it is virtualbut can simulate suffix tree traversalefficiently. Fig3 ChildTab records the linked relationship in the lcp-interval tree
Enhanced suffix array replaces suffix tree • Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity • Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree • Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query
Answer Decision Queries Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <>⊥ and c<m and queryFound = True if i <> j then l := getlcp(i, j ) min := min{l, m} queryFound := S[suftab[i]+ c..suftab[i]+min − 1] = P[c..min− 1] c := min (i, j ) := getInterval(i, j,P[c]) else queryFound := S[suftab[i]+ c..suftab[i]+ m− 1] = P[c..m− 1] if queryFound then Report [i, j] as a occurrence of P else print(P is not found in S)
Answer Decision Queries (cont’d) P=cb P=caaa Longest common string
Additional tables eat too much space? There are tricks to reduce space requirements. • If string length n=|S| <232,each integer index needs 4 bytes. • suftab needs 4n; lcptab also needs 4n? • No! Usually only a few entries in lcptab >255. So • Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values • Space saved, time efficiency reserved though the worst-case time complexity may be affected
Conclusions • Suffix array: there is always a tension between space and speed. Research tries to release the tension; • Suffix array can replace with suffix tree; • Suffix array is practical: Faster and easier to implement
References • [Abo04] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2, Issue 1 (March 2004) p.54-86 • [Abo02A] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September 17-21,2002, p.449-463 • [Abo02B] Mohamed Ibrahim Abouelhoda , Enno Ohlebusch , Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 • [Gon92] Gaston H. Gonnet , Ricardo A. Baeza-Yates , Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 • [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, Software—Practice and Experience 29 (13) (1999) 1149–1171. • [Man93] Udi Manber , Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p.935-948, Oct. 1993