290 likes | 497 Views
Simple Linear Work Suffix Array Construction. J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003. Work. 在分析 parallel algorithm 時,常用到二種 複雜度 : time and work complexity. Time t(n) : 須執行多少步驟 .
E N D
Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003
Work 在分析 parallel algorithm 時,常用到二種 複雜度: time and work complexity. • Time t(n) : 須執行多少步驟. • Work w(n): t(n) * (所用到的processors的數目). 這篇paper主要的貢獻在於它的方法應用在 External Memory 或 Cache Oblivious model上也是optimal, 而應用在 BSP 和 EREW-PRAM model 上則可以和現有 的演算法有相同的 work complexity, 但更好的 time complexity. 但以下報告內容將只針對RAM model 的 time complexity 作分析.
Today’s Work Suffix Array Depth Array Suffix Tree
Model of Alphabet • Constant alphabet: The size of alphabet is constant. • Integer alphabet: Characters are integers in [1 … n], where n is the number of input characters.
Topic 1: Suffix Array • A suffix array SA of s is the result of sorting the suffixes of s lexicographically. ex: 012 s = [ a b a ] s0 = a b a s1 = b a s2 = a 0 1 2 => SA = [ s2 s0 s1 ] [ 2 0 1 ] in implementation = Some conventions: We call the suffix starting from the the index i as the ith suffix. 除3不等於0的suffix = { ith suffix| i != 0 mod 3} 除3等於0的suffix = { ith suffix| i == 0 mod 3}
Suffix Array Problem • Input: a string s with length n • Output: a suffix array SA of s • Time: O(n)
GetSA Algorithm Outline • Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. • Step 2: SA= 0 = sort the suffixes starting at position i = 0 mod 3. • Step 3: SA = merge SA= 0 and SA≠ 0 .
Step1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 0 1 2 3 4 5 6 7 8 9 10 11 12 • 選代表 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i $ $ m i s s i s s i p p i Radixsort 3 3 2 1 5 5 4 1 4 7 10 2 5 8 Let 代= [ 3 3 2 1 5 5 4 ] => getSA(代 ) = SA代= [ 10 7 4 1 8 5 2 ] in T(2n/3) Claim: SA≠0 = SA代
Why SA代= SA≠0 ? 代= [ 3 3 2 1 5 5 4 ] s = m i s s i s s i p p i 代1= 3 3 2 1 5 5 4 1 4 7 10 2 5 8 0 12 3 45 6 78 9 10 s1 = i s s i s s i p p i s4 = 3 2 1 5 5 4 = i s s i p p i 代4 s7 = 2 1 5 5 4 = i p p i 代7 s10 = 1 5 5 4 = i 代10 = 5 5 4 s2 代2 = s s i s s i p p i = 5 4 s5 代5 s s i p p i = = 4 代8 p p i s8 = SA代= SA≠ 0 = [ 107418 5 2 ], It suffices to show that 代i < 代j <=> si < sj.
代 i < 代j<=> si < sj Case 1: i = j mod 3 1 4 7 102 5 8 0 12 3 45 6 78 9 10 11 12 代= [4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 102 5 8 4 5 6 7 8 9 10 11 12 代4= [ 4 3 2 6 6 5 ] s4 = [ i s s i p p i $ $ ] 1 4 7 102 5 8 1 2 3 4 5 6 7 8 9 10 11 12 代1= [ 4 4 3 2 6 6 5 ] s1 = [ i s s i s s i p p i $ $ ] s4 < s1 代4 < 代1
代 i < 代j<=> si < sj Case 2: i ≠ j mod 3 1 4 7 102 5 8 0 12 3 45 6 78 9 10 11 12 s12 = [4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 102 5 8 4 5 6 7 8 9 10 11 12 代4= [ 4 3 2 6 6 5 ] s4 = [ i s s i p p i $ $ ] 5 8 5 6 7 8 9 10 代5 = [ 6 5 ] s5=[ s s i p p i ] 代4 < 代5 s4 < s5
Step2: SA= 0= sort the suffixes starting at position i = 0 mod 3. ∵ The rank of sj among {sk | k ≠ 0 mod 3 } was determined in Step1 for all j ≠ 0 mod 3. ∴ Let rank≠0 (sj) = rank of sj among {sk | k ≠ 0 mod 3 } for all j ≠ 0 mod 3. SA=0= radix sort { (s[i], rank≠0(si+1)) | i = 0 mod 3 }.
Step 3: SA = merge SA= 0and SA≠ 0. • SA= 0= [s0s9s6s3] • SA≠0= [s11s10s7s1s8s5s2] • SA = merge SA= 0and SA≠0 =[s11 s10 s7 s4 s1 s0 s9 s8 s6 s3 s5 s2] It is in time O(n) if we can determine the relative order of Si SA= 0 and Sj SA≠0in constant time.
Compare Siand Sj where i = 0 , j ≠ 0 mod 3: case 1: j = 1 mod 3 ∵ i + 1 = 1 mod 3, j+1 = 2 mod 3 ∴ compare (s[i], rank≠0(si+1) ) with (s[j], rank≠0(sj+1) ) in constant time. case 2: j = 2 mod 3 ∵ i + 2 = 2 mod 3, j+2 = 1 mod 3 ∴ compare (s[i], s[i+1], rank≠0(si+2)) with (s[j], s[j+1], rank≠0(sj+2)) in constant time
Time complexity analysis • Step1: O(n) + T(2n/3) • Step2: O(n) • Step3: O(n) • T(n) = O(n) + T(2n/3) = O(n)
Topic 2: Depth array • Definition of Depth array: . . . . . . 0 i-1 i n - 1 SA = Sk Sj . . . . . . 1 i n - 1 DA = DA[i] = longest common prefix of Sj and Sk sk sj
Depth array problem • Input: a string s and its suffix array SA. • Output: a depth array DA of s. • Time: O(|s|) = O(n)
Lemma1: di≥ di-1 -1 i . . . i’ . . . 0 n - 1 S = Si Si ’ . . . . . . 0 rank( i ) n - 1 SA = Si ’ Si . . . . . . n - 1 1 rank( i ) di DA = DA[ rank( i ) ] = di si ’ si
Lemma1: di ≥ di-1 -1 i-1 i . . . . . . 0 n - 1 S = Si-1 Si rank( i ) rank( i - 1) SA = Si ’ S(i – 1)’ Si - 1 Si rank( i ) rank( i - 1) DA = di di-1 1 di-1 di-1 - 1 di si ’ si s( i- 1) ’ si- 1
Lemma1: di≥ di-1 -1 Pf: 1 di di-1 - 1 s( i- 1) ’ si- 1 si ’ si
Lemma1: di≥ di-1 -1 Pf: 1 di di-1 - 1 s( i- 1) ’ si- 1 si ’ s (i-1)’+1 si if < => si ’ < s(i- 1)’+1 < si -><-
How to compute diwhen di-1 is given ? • By Lemma1: di≥ di-1– 1, it suffices to compare si and si ’ from the di-1-th character. di-1 - 1 di si ’ si
Algorithm GetDepth Input: A string s and its suffix array SA 1. d1 = by naïvely comparing s1 and s1’ ; 2. For i := 2 to n-1 do 3. di = by comparing si and si ’ from the (di-1 )-th character; 4. End for Time complexity Analysis: Iteration i: ( di – di-1 + 1) + 1 = di – di-1 + 2 Total =
Topic 3: Suffix Tree Problem • Input: a string s with length n. • Output: a suffix tree ST of s. • Time: O(|s|) = O(n)
GetST Algorithm Outline Algorithm GetST(s) 1. SA = suffix array of s; 2. DA = depth array of s; 3. For i:=0 to n-1 STi = add the SA[i]-th suffix into STi-1. 4. End for 5. Return STn-1;
How to add the SA[i]-th suffix into STi-1? 0 i-1 i n - 1 SA = Sk Sj Observation: The SA[i-1]-th suffix is the right_most_path RP of STi-1, so the longest common prefix of RP and SA[i]-th suffixis DA[ i ]. . . . . . . 1 i n - 1 DA = DA[ i ] [ (SA[i] + DA[i]), - ]
Each node is go over at most once DA[ i ] [ (SA[i] + DA[i]), - ] Nodes on this path will not be go over again.
Time Complexity Analysis • Because each node is go over at most once and there are at most 2n nodes in the tree, the time complexity is O(n).
Conclusions • Advantages: • Alphabet 的限制 • 硬碟的I/O • Easy to show • Disadvantages: • 沒有incremental 的特性