250 likes | 506 Views
Suffix tree and suffix array techniques for pattern analysis in strings. Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006. Pattern finding & synthesis problems. T = t 1 t 2 … t n , P = p 1 p 2 … p n , strings of symbols in finite alphabet
E N D
Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006
Pattern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn, strings of symbols in finite alphabet • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different patterns P can be found fast • static text, any given pattern P • Pattern synthesis problem: Learn from T new patterns that occur surprisingly often • What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, …
Suffix tree • Suffix array • Some applications • Finding motifs
Suffix array: example ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6 hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε • suffix array = lexicographic order of the suffixes
Suffix array • suffix array SA(T) = an array giving the lexicographic order of the suffixes of T • space requirement: 5|T|למה 5? • practitioners like suffix arrays (simplicity, space efficiency) • theoreticians like suffix trees (explicit structure)
Pattern search from suffix array ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6 hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε att binary search
m m l l l m u u u pat pat pat U= m l = m • The search time is O(m log n), where m = length of search string, n = length of text (and size of suffix array). With LCA = longest common ancestor time = O(m + log n).
Recent suffix array constructions • Manber&Myers (1990): O(|T|log|T|) • linear time via suffix tree • January / June 2003: direct linear time construction of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03)
Kärkkäinen-Sanders algorithm • Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively. • Construct the suffix array of the remaining suffixes using the result of the first step. • Merge the two suffix arrays into one.
Notation • string T = T[0,n) = t0t1 … tn-1 • suffix Si = T[i,0) = titi+1 … tn-1 • for C [0,n]: SC = {Si | i in C} • suffix array SA[0,n] of T is a permutation of [0,n] satisfying SSA[0] < SSA[1] < … < SSA[n] T[SA[0],n)
Running example 0 1 2 3 4 5 6 7 8 9 10 11 • T[0,n) = y a b b a d a b b a d o 0 0 • SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
Step 0: Construct a sample • for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample positions • SC sample suffixes • Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}
Step 1: Sort sample suffixes • for k = 1,2, construct Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2] R = R1 º R2 (concatenation of R1 and R2) Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; The correspondence is order preserving: Let Ri’Si andRj’Sj.Then Ri’<Rj’ iff Si < Sj
Sort the suffixes of R Radix sort the characters and rename with ranks to obtain R´. Example:R1 R2 R = [abb][ada][bba][do0] [bba][dab][bad][o00] 1 2 3 4 5 6 7 [abb][ada][bad][bba] [dab] [do0] [o00] R´ = (1,2,4,6,4,5,3,7) If all characters are different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders. Note: |R´| = 2n/3.
Step 1 (cont.) • Once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0 • Example: R´ = (1,2,4,6,4,5,3,7) 0: ε3: 37 6: 537 1:12464537 4: 4537 7: 64537 2:24645,7 5: 464537 8: 7 SAR´ = (8,0,1,6,4,2,5,3,7) (The suffix array for R’) SAR´-1 = (1 2 5 74 6 3 8) rank(Si) (– 1 4– 2 6– 5 3– 7 8–0 0 )
Step 2: Sort nonsample suffixes • for each non-sample Siє SB0 (note that rank(Si+1) is always defined for i є B0): Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • radix sort the pairs (ti,rank(Si+1)). • Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
יש לפרט יותר Example: S12 < S6 < S9 < S3 < S0 because S0 = yabbadabbado= yS1=(y,S3 = badabbado= bS4=(b, S6 = abbado= aS7=(a S9 =ado= aS10=(a S12=0 = 0eps = (0,0) (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
Step 3: Merge • merge the two sorted sets of suffixes using a standard comparison-based merging: • to compare Siє SC with Sjє SB0, distinguish two cases: • i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2)) • note that the ranks are defined in all cases! • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7) B1 B2
Running time O(n) • excluding the recursive call, everything can be done in linear time • the recursion is on a string of length 2n/3 • thus the time is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n)
Implementation • about 50 lines of C++ • code available e.g. via Juha Kärkkäinen’s home page
LCP table • Longest Common Prefix of successive elements of suffix array: • LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1] • Algorithm: • Enter the suffixes in a trie • Find the lca. • Complexity = O(n2)
Kasai et al, CPM2001 Key observation: Let LCP[q]=h>1, i.e., S SA[q] = titi+1…ai+h-1ti+h S SA[q+1]= tktk+1…tk+h-1tk+h = titi+1…ti+h-1ti+h (tk+h≠ti+h) • Then ti+1…ti+h-1=tk+1…tk+h-1,. • Define p SSA[p] =ti+1…ti+h-1…therefore SSA[p+1]=ti+1…ti+h-1 … • i.e., LCP[p] ≥ h-1 • When computing LCP[p] we can start the comparisons at position p+h-1.
The algorithm for(i=0; i<n; i++) /* compute SA-1 */ SA-1[SA[i]] = i; h = 0; for(p=0; p<n; p++) { if(SA-1[p] > 0){ r = SA[SA-1 [p]+1] ; while(T[r+h] = T[p+h]) h++; LCP[SA-1 [p]] = h; if(h > 0) h--; } } innermost statement Complexity: Since h is decreased at most n times, and h ≤ n, h can be increased at most 2n times; i.e., the innermost statement isexecuted ≤ 2n times. Total time = O(n).
SSA[0] Suffix tree vs suffix array • suffix tree suffix array + LCP table First step
SSA[0] S SA[i] SSA[i-1] • Step i Which edge to split? Complexity: The final trie has 2n vertices. Each edge is traversed ≤ twice. Time = O(n).