780 likes | 910 Views
Suffix tree and suffix array techniques for pattern analysis in strings. Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005.
E N D
Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgtttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt
Algorithms for combinatorial string matching? • deep beauty? +- • shallow beauty? + • applications? ++ • intensive algorithmic miniatures • sources of new problems: text processing, DNA, music,…
Analysis of a string of symbols • T = h a t t i v a t t i ’text’ • P = a t t ’pattern’ • Find the occurrences of P in T: h a t t i v a t t i • Pattern synthesis: #(t) = 4 #(atti) = 2 #(t****t) = 2
Pattern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn, strings of symbols in finite alphabet • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different patterns P can be found fast • static text, any given pattern P • Pattern synthesis problem: Learn from T new patterns that occur surprisingly often • What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, …
Suffix tree • Suffix array • Some applications • Finding motifs
The suffix tree Tree(T) of T • data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T • linear size: |Tree(T)| = O(|T|) • can be constructed in linear time O(|T|) • has myriad virtues (A. Apostolico) • is well-known: 366 000 Google hits
Suffix trie and suffix tree abaab baab aab ab b Trie(abaab) Tree(abaab) a a b baab b a a ab baab a b a a b b
Trie(T) can be large • |Trie(T)| = O(|T|2) • bad example: T = anbn • Trie(T) can be seen as a DFA: language accepted = the suffixes of T • minimize the DFA => directed cyclic word graph (’DAWG’)
Tree(T) is of linear size • only the internal branching nodes and the leaves represented explicitly • edges labeled by substrings of T • v = node(α) if the path from root to v spells α • one-to-one correspondence of leaves and suffixes • |T| leaves, hence < |T| internal nodes • |Tree(T)| = O(|T| + size(edge labels))
Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i vatti i t vatti i ti atti i vatti hattivatti ivatti ti vatti vatti tti vatti atti tivatti hattivatti ttivatti attivatti
Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i vatti i t vatti i ti atti i vatti hattivatti ivatti ti vatti vatti tti vatti atti tivatti hattivatti ttivatti attivatti substring labels of edges represented as pairs of pointers hattivatti
Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i vatti i 6 3,3 10 4,5 2,5 i vatti hattivatti 5 9 vatti 6,10 8 6,10 7 4 1 3 2 hattivatti
Tree(T) is full text index Tree(T) P P occurs in T at locations 8, 31, … 31 8 P occurs in T P is a prefix of some suffix of T Path for P exists in Tree(T) All occurrences of P in time O(|P| + #occ)
Find att from Tree(hattivatti) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i vatti t vatti i ti atti i vatti hattivatti ivatti ti vatti vatti tti vatti atti tivatti 7 hattivatti ttivatti attivatti 2
Linear time construction of Tree(T) hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i McCreight (1976) Weiner (1973), ’algorithm of the year’ ’on-line’ algorithm (Ukkonen 1992)
On-line construction of Trie(T) • T = t1t2 … tn$ • Pi = t1t2 … ti i:thprefix of T • on-line idea: update Trie(Pi) to Trie(Pi+1) • => very simple construction
Trie(abaab) Trie(a) Trie(ab) Trie(aba) abaa baa aa εa ε a a b a b b b a a chain of links connects the end points of current suffixes
Trie(abaab) Trie(abaa) a a b a b a b b b b a a a a a a a
Trie(abaab) Trie(abaa) a a b a b a b b b b a a a a a a a Add next symbol = b
Trie(abaab) Trie(abaa) a a b a b a b b b b a a a a a a a Add next symbol = b From here on b-arc already exists
Trie(abaab) a a b a b a b b b b a a a a a a a Trie(abaab) a b b a a a b a a b b
What happens in Trie(Pi) => Trie(Pi+1) ? Before From here on the ai-arc exists already => stop updating here ai After ai ai ai ai ai New suffix links New nodes
What happens in Trie(Pi) => Trie(Pi+1) ? • time: O(size of Trie(T)) • suffix links: slink(node(aα)) = node(α)
On-line procedure for suffix trie • Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root • fori := 2 to ndo begin • vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf) • v := vi-1; v´ := 0 • while node v has no outgoing arc for tido begin • Create a new node v´´ and an arc son(v,ti) = v´´ • ifv´ ≠ 0thenslink(v) := v´´ • v := slink(v); v´ := v´´end • forthe node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´
Suffix trees on-line • ’compacted version’ of the on-line trie construction: simulate the construction on the linear size tree instead of the trie => time O(|T|) • all trie nodes are conceptually still needed => implicit and real nodes
Implicit and real nodes • Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and αis a (proper) prefix of the label of some arc from v. If αis the empty string then (v, α) is a’real’node (= v). • Let v = node(α´) inTree(T).Thenimplicit node (v, α) represents node(α´α) of Trie(T)
Implicit node … α´ v … α (v, α)
Suffix links and open arcs root α … aα slink(v) v … label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T w
Big picture suffix link path traversed: total work O(n) … … … … … new arcs and nodes created: total work O(size(Tree(T))
On-line procedure for suffix tree Input: string T = t1t2 … tn$ Output: Tree(T) Notation: son(v,α) = w iff there is an arc from v to w with label α son(v,ε) = v Function Canonize(v, α): while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do v := son(v, α´); α := α´´ return (v, α)
Suffix-tree on-line: main procedure Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ for i := 2 to n+1 do v´ := 0 whilethere is no arc from v with label prefixαtido ifα ≠ εthen /* divide the arc w = son(v, αη) into two */ son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else son(v,ti) := v´´´; v´´ := v if v´ ≠ 0 then slink(v´) := v´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) if v´ ≠ 0 then slink(v´) := v (v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */
The actual time and space • |Tree(T)| is about 20|T| in practice • brute-force construction is O(|T|log|T|) for random strings as the average depth of internal nodes is O(log|T|) • difference between linear and brute-force constructions not necessarily large (Giegerich & Kurtz) • truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) • alphabet independent linear time (Farach 1997)
Suffix tree • Suffix array • Some applications • Finding motifs
Suffix array: example ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6 hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε • suffix array = lexicographic order of the suffixes
Suffix array • suffix array SA(T) = an array giving the lexicographic order of the suffixes of T • space requirement: 5|T| • practitioners like suffix arrays (simplicity, space efficiency) • theoreticians like suffix trees (explicit structure)
Pattern search from suffix array ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6 hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε att binary search
Recent suffix array constructions • Manber&Myers (1990): O(|T|log|T|) • linear time via suffix tree • January / June 2003: direct linear time construction of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03)
Kärkkäinen-Sanders algorithm • Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively. • Construct the suffix array of the remaining suffixes using the result of the first step. • Merge the two suffix arrays into one.
Notation • string T = T[0,n) = t0t1 … tn-1 • suffix Si = T[i,0) = titi+1 … tn-1 • for C \subset [0,n]: SC = {Si | i in C} • suffix array SA[0,n] of T is a permutation of [0,n] satisfying SSA[0] < SSA[1] < … < SSA[n]
Running example • T[0,n) = y a b b a d a b b a d o 0 0 … • SA = (12,1,6,4,9,3,8,2,7,5,10,11,0) 0 1 2 3 4 5 6 7 8 9 10 11
Step 0: Construct a sample • for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample positions • SC sample suffixes • Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}
Step 1: Sort sample suffixes • for k = 1,2, construct Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2] R = R1 ^ R2 concatenation of R1 and R2 Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; correspondence is order preserving. Sort the suffixes of R: radix sort the characters and rename with ranks to obtain R´. If all characters different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders. Note: |R´| = 2n/3.
Step 1 (cont.) • once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0 • Example: R = [abb][ada][bba][do0][bba][dab][bad][o00] R´ = (1,2,4,6,4,5,3,7) SAR´ = (8,0,1,6,4,2,5,3,7) rank(Si) - 1 4 - 2 6 - 5 3 – 7 8 – 0 0
Step 2: Sort nonsample suffixes • for each non-sample Siє SB0 (note that rank(Si+1) is always defined for i є B0): Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • radix sort the pairs (ti,rank(Si+1)). • Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
Step 3: Merge • merge the two sorted sets of suffixes using a standard comparison-based merging: • to compare Siє SC with Sjє SB0, distinguish two cases: • i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2)) • note that the ranks are defined in all cases! • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)
Running time O(n) • excluding the recursive call, everything can be done in linear time • the recursion is on a string of length 2n/3 • thus the time is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n)
Implementation • about 50 lines of C++ • code available e.g. via Juha Kärkkäinen’s home page
LCP table • Longest Common Prefix of successive elements of suffix array: • LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1] • build inverse array SA-1 from SA in linear time • then LCP table from SA-1 in linear time (Kasai et al, CPM2001)
Suffix tree vs suffix array • suffix tree suffix array + LCP table