490 likes | 626 Views
On the Sorting-Complexity of Suffix Tree Construction. MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN. Requires Math fonts downloadable from here. Fact From the Previous Talk.
E N D
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from here
Fact From the Previous Talk Harel and Tarjan 1984,Bender and Farach-Colton 2000 A tree T with m nodes can be preprocessed in O(m) time so that, for any pair of its nodes u, v, lca(u, v) can be computed in constant time.
What’s in This Paper • Bounds depend on the alphabet • Constant size alphabet – O(n) (Weiner 1973) • For unbounded alphabet (n log n) • For {1…n} – linear time • RAM algorithm • DAM algorithm (I/O optimal) • Algorithm also works for PRAM, PDAM
Talk Outline • Suffix trees • Reminder • Tools • RAM algorithm for suffix tree construction • Conclusion
S[8,13]=12221$ Suffix Trees S = n = 13 1 $ 2 13 1 $ 2 12 2 2 11 3 4 1 9 7 21$ 6 5 10 8
Suffix Tree Representation 13 1 $ 2 12 2 11 3 4 1 9 7 6 5 10 8 l=
Properties of Suffix Trees lcp((v), (w)) = |(lca(v, w)| 1 =11L=2 =1L=1 lca(v, w) 13 1 2 v w 12 =12L=2 1 2 2 11 3 4 1 9 7 6 5 10 8
Suffix Links Lemma [Weiner 1973] Let a and *.If there is a node v in Ts such that (v)=a,then there is a node w in Ts such that (w)= .Define the suffix link as sl(v) = w.
Suffix Links 1 2 =1L=1 =2L=1 13 1 2 2 12 =12L=2 1 2 2 11 3 4 1 9 7 =122 L=3 6 5 10 8
Suffix Links Example 1 1 13 2 12 2 2 2 2 11 3 4 1 3 9 7 3 6 5 10 8
Suffix Arrays • Let ={Si | Si* , |Si|=ni} • T = compacted trie of • In order traversal of leaves gives strings in lexicographical order – S p1, …, S p|| • sort arrayAT[i]=pi • longest common prefix array LCPT[i] = lcp(S pi, S pi+1)
Suffix Array Example 1 =11L=2 13 1 12 2 11 3 4 1 9 7 6 5 10 8 AT LCPT
RAM Algorithm Input: string S Output: Ts Divide and Conquer: • Recursively compute To – compacted trie of suffixes beginning at odd positions • Recursively compute Te – compacted trie of suffixes beginning at even positions • Merge Te and To to get Ts
Divide and Conquer Scheme A(n) Divide A(n/2) A(n/2) A(n/4) A(n/4) A(n/4) A(n/4) Conquer S(n/2) S(n/2) Merge S(n)
6 4 3 2 1 5 RAM Algorithm Scheme |S|=n, =[n] Divide |S’|=n/2, ’=[n/2] TS’ (n/2) Conquer ATs’ (n/2), LCPTs’ (n/2) ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) TS (n)
4 3 1 5 Switching Representations |S|=n, =[n] Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)
Suffix Tree Suffix Array 1 =11L=2 13 1 12 2 11 3 4 1 9 7 6 5 10 8 AT LCPT
Suffix Array Suffix Tree 1 =11L=2 13 1 12 2 11 3 4 1 9 7 6 5 10 8 AT LCPT
5 4 Compressing S |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)
Compressing S • Input: |S|=n =[n] • Map character pairs into single characters: • For i=1 to n form pairs S[2i-1], S[2i] • Sort lexicographically by radix sort O(n) • Remove duplicates • S’[i] = rank of S[2i-1], S[2i] • Now |S’|=n/2 and ’=[n/2]
Example S=121112212221$ =[13] • Pairs1,2 1,1 1,2 2,1 2,2 2,1 • Ordered pairs1,1 1,2 1,2 2,1 2,1 2,2 • Duplicates removed1,1 1,2 2,1 2,2 • S’=212343$ =[4]
5 4 Decompressing S |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)
Decompressing S • Input : ATs’ , LCPTs’ • Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[n/2]$ • ATo[i] = ATs’[i] · 2 – 1
5 Building the Even Tree |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 4 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)
Building the Even Tree • Input : ATo , LCPTo • Observation : P = even suffix of Sthen P = aP’ and P’ = odd suffix of S • To get ATe apply radix sort on even suffixes S[2i,n] using the keys S[2i], S[2i+1,n]
Merging To and Te |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 4 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) 5 Merge AT(n/2), LCPT (n/2) 6 TS (n)
Merging To and Te • Input : ATo, LCPTo and ATe, LCPTe • Trivial method – sort suffixes lexicographically (n2) • What if we have an oracle forlcp(S[2i, n], S[2j-1, n]) ? • Merge ATo and ATe directly (like sorted lists) • Compute LCPTfrom previous results: • lcp of adjacent odd suffixes by LCPTo • lcp of adjacent even suffixes by LCPTe • lcp of odd suffix and even suffix by oracle
Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM
Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1
Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 1 A+D
Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 1 2 A+D B
Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 3 1 2 A+C B E
Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 2 3 1 C+F 2 A+C B E
Coupled-DFS (the compacted case) T1 T2 1234 2 12 2 1 2 C 1 3 F A B D E TM 1234
Coupled-DFS (the compacted case) T1 T2 2 12 2 1234 12 C 1 3 34 F 1 2 D E A B TM 12 2 1 3 C+F D G
Over-Merging To and Te • How do we merge compacted tries? • An over-merge is like a merge but: • Compare only first characters of edges • In case of two edges with different lengths, k<l break l into k and l-k • Identify edges with first letter only
Over-Merge Example T1 T2 2 13 2 1234 12 C 1 3 34 F 1 2 D E A B TM 1x 2 1 3 C+F D G
Over-Merge of Running Example To S=121112212221$ 13 1 1 3 9 2 2 7 1 11 5
Over-Merge of Running Example Te S=121112212221$ 1 1 4 8 12 2 2 6 10
Over-Merge of Running Example TM S=121112212221$ 1 13 10 4 2 12 2 3 3 1 2 7 8 11 9 6 6 10 5
Building the lcp Oracle • Definitions • Node in both TM and To is odd • Node in both TM and Te is even • Node with both odd and even descendents is odd/even • For every odd/even node u find l2i and l2j-1 such that u = lca(l2i, l2j-1) • Compute d(u) = lca(l2i+1, l2j) • Compute (u) = depth(u) in d-pointers tree
Over-Merge of Running Example TM S=121112212221$ 13 1 10 4 2 12 2 3 3 1 2 7 8 11 9 6 6 10 5
Main Theorem The function d defines a tree on the odd/evennodes of TM, and for any l2i and l2j-1 we have ( lca(l2i, l2j-1) ) = lcp(S[2i,n], S[2j-1,n])
Helpful Observations Let u be an odd/even node in TM.u is Either even or odd and so L(u) is defined.Let u be an even node:1. For l2i and l2j below ulcp(S[2i,n], S[2j,n]) L(u)2. For l2i’-1 and l2j’-1 below ulcp(S[2i’-1,n], S[2j’-1,n]) L(u)3. For l2i” and l2j”-1 below ulcp(S[2i”,n], S[2j”-1,n]) L(u) Symmetrical proof is u is an odd node.
Lemma The lcp value of any odd and even pair of leaves whose lca is u must be the same Proof:Suppose lca(l2i’, l2j’-1) = lca(l2i’’, l2j”-1) = u lcp(S[2i’,n], S[2j’-1,n]) = k L(u)lcp(S[2i’,n], S[2i”,n]) L(u) k lcp(S[2i”,n], S[2j’-1,n]) = k L(u) k S[2i’,n] S[2j’-1,n] S[2i”,n]
Induction on the lcp Pick a pair of odd an even suffixes S[2i’,n] and S[2j’-1,n].Base: If S[2i’] S[2j’-1] then lca = root (recall the merge procedure) lcp = 0.Assumption: Suppose theorem is true for lcp< k.Induction Step:lcp(S[2i,n], S[2j-1,n]) = k > 0u = lca(l2i, l2j-1) u root.Suppose d(u) = lca(l2i’+1, l2j’) then:(u) =1 1 + (d(u)) =2 1 + lcp(S[2i’+1,n], S[2j’,n]) =3lcp(S[2i,n], S[2j-1,n])
Done! |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 4 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) 5 Merge AT(n/2), LCPT (n/2) 6 TS (n)