Simple Linear Work Suffix Array Construction

Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003

Work 在分析 parallel algorithm 時，常用到二種複雜度: time and work complexity. • Time t(n) : 須執行多少步驟. • Work w(n): t(n) * (所用到的processors的數目). 這篇paper主要的貢獻在於它的方法應用在 External Memory 或 Cache Oblivious model上也是optimal, 而應用在 BSP 和 EREW-PRAM model 上則可以和現有的演算法有相同的 work complexity, 但更好的 time complexity. 但以下報告內容將只針對RAM model 的 time complexity 作分析.

Today’s Work Suffix Array Depth Array Suffix Tree

Model of Alphabet • Constant alphabet: The size of alphabet is constant. • Integer alphabet: Characters are integers in [1 … n], where n is the number of input characters.

Topic 1: Suffix Array • A suffix array SA of s is the result of sorting the suffixes of s lexicographically. ex: 012 s = [ a b a ] s0 = a b a s1 = b a s2 = a 0 1 2 => SA = [ s2 s0 s1 ] [ 2 0 1 ] in implementation = Some conventions: We call the suffix starting from the the index i as the ith suffix. 除3不等於0的suffix = { ith suffix| i != 0 mod 3} 除3等於0的suffix = { ith suffix| i == 0 mod 3}

Suffix Array Problem • Input: a string s with length n • Output: a suffix array SA of s • Time: O(n)

GetSA Algorithm Outline • Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. • Step 2: SA= 0 = sort the suffixes starting at position i = 0 mod 3. • Step 3: SA = merge SA= 0 and SA≠ 0 .

Step1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 0 1 2 3 4 5 6 7 8 9 10 11 12 • 選代表 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i $ $ m i s s i s s i p p i Radixsort 3 3 2 1 5 5 4 1 4 7 10 2 5 8 Let 代= [ 3 3 2 1 5 5 4 ] => getSA(代 ) = SA代= [ 10 7 4 1 8 5 2 ] in T(2n/3) Claim: SA≠0 = SA代

Why SA代= SA≠0 ? 代= [ 3 3 2 1 5 5 4 ] s = m i s s i s s i p p i 代1= 3 3 2 1 5 5 4 1 4 7 10 2 5 8 0 12 3 45 6 78 9 10 s1 = i s s i s s i p p i s4 = 3 2 1 5 5 4 = i s s i p p i 代4 s7 = 2 1 5 5 4 = i p p i 代7 s10 = 1 5 5 4 = i 代10 = 5 5 4 s2 代2 = s s i s s i p p i = 5 4 s5 代5 s s i p p i = = 4 代8 p p i s8 = SA代= SA≠ 0 = [ 107418 5 2 ], It suffices to show that 代i < 代j <=> si < sj.

代 i < 代j<=> si < sj Case 1: i = j mod 3 1 4 7 102 5 8 0 12 3 45 6 78 9 10 11 12 代= [4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 102 5 8 4 5 6 7 8 9 10 11 12 代4= [ 4 3 2 6 6 5 ] s4 = [ i s s i p p i $ $ ] 1 4 7 102 5 8 1 2 3 4 5 6 7 8 9 10 11 12 代1= [ 4 4 3 2 6 6 5 ] s1 = [ i s s i s s i p p i $ $ ] s4 < s1 代4 < 代1

代 i < 代j<=> si < sj Case 2: i ≠ j mod 3 1 4 7 102 5 8 0 12 3 45 6 78 9 10 11 12 s12 = [4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 102 5 8 4 5 6 7 8 9 10 11 12 代4= [ 4 3 2 6 6 5 ] s4 = [ i s s i p p i $ $ ] 5 8 5 6 7 8 9 10 代5 = [ 6 5 ] s5=[ s s i p p i ] 代4 < 代5 s4 < s5

Step2: SA= 0= sort the suffixes starting at position i = 0 mod 3. ∵ The rank of sj among {sk | k ≠ 0 mod 3 } was determined in Step1 for all j ≠ 0 mod 3. ∴ Let rank≠0 (sj) = rank of sj among {sk | k ≠ 0 mod 3 } for all j ≠ 0 mod 3. SA=0= radix sort { (s[i], rank≠0(si+1)) | i = 0 mod 3 }.

Step 3: SA = merge SA= 0and SA≠ 0. • SA= 0= [s0s9s6s3] • SA≠0= [s11s10s7s1s8s5s2] • SA = merge SA= 0and SA≠0 =[s11 s10 s7 s4 s1 s0 s9 s8 s6 s3 s5 s2] It is in time O(n) if we can determine the relative order of Si SA= 0 and Sj SA≠0in constant time.

Compare Siand Sj where i = 0 , j ≠ 0 mod 3: case 1: j = 1 mod 3 ∵ i + 1 = 1 mod 3, j+1 = 2 mod 3 ∴ compare (s[i], rank≠0(si+1) ) with (s[j], rank≠0(sj+1) ) in constant time. case 2: j = 2 mod 3 ∵ i + 2 = 2 mod 3, j+2 = 1 mod 3 ∴ compare (s[i], s[i+1], rank≠0(si+2)) with (s[j], s[j+1], rank≠0(sj+2)) in constant time

Time complexity analysis • Step1: O(n) + T(2n/3) • Step2: O(n) • Step3: O(n) • T(n) = O(n) + T(2n/3) = O(n)

Topic 2: Depth array • Definition of Depth array: . . . . . . 0 i-1 i n - 1 SA = Sk Sj . . . . . . 1 i n - 1 DA = DA[i] = longest common prefix of Sj and Sk sk sj

Depth array problem • Input: a string s and its suffix array SA. • Output: a depth array DA of s. • Time: O(|s|) = O(n)

Lemma1: di≥ di-1 -1 i . . . i’ . . . 0 n - 1 S = Si Si ’ . . . . . . 0 rank( i ) n - 1 SA = Si ’ Si . . . . . . n - 1 1 rank( i ) di DA = DA[ rank( i ) ] = di si ’ si

Lemma1: di ≥ di-1 -1 i-1 i . . . . . . 0 n - 1 S = Si-1 Si rank( i ) rank( i - 1) SA = Si ’ S(i – 1)’ Si - 1 Si rank( i ) rank( i - 1) DA = di di-1 1 di-1 di-1 - 1 di si ’ si s( i- 1) ’ si- 1

Lemma1: di≥ di-1 -1 Pf: 1 di di-1 - 1 s( i- 1) ’ si- 1 si ’ si

Lemma1: di≥ di-1 -1 Pf: 1 di di-1 - 1 s( i- 1) ’ si- 1 si ’ s (i-1)’+1 si if < => si ’ < s(i- 1)’+1 < si -><-

How to compute diwhen di-1 is given ? • By Lemma1: di≥ di-1– 1, it suffices to compare si and si ’ from the di-1-th character. di-1 - 1 di si ’ si

Algorithm GetDepth Input: A string s and its suffix array SA 1. d1 = by naïvely comparing s1 and s1’ ; 2. For i := 2 to n-1 do 3. di = by comparing si and si ’ from the (di-1 )-th character; 4. End for Time complexity Analysis: Iteration i: ( di – di-1 + 1) + 1 = di – di-1 + 2 Total =

Topic 3: Suffix Tree Problem • Input: a string s with length n. • Output: a suffix tree ST of s. • Time: O(|s|) = O(n)

GetST Algorithm Outline Algorithm GetST(s) 1. SA = suffix array of s; 2. DA = depth array of s; 3. For i:=0 to n-1 STi = add the SA[i]-th suffix into STi-1. 4. End for 5. Return STn-1;

How to add the SA[i]-th suffix into STi-1? 0 i-1 i n - 1 SA = Sk Sj Observation: The SA[i-1]-th suffix is the right_most_path RP of STi-1, so the longest common prefix of RP and SA[i]-th suffixis DA[ i ]. . . . . . . 1 i n - 1 DA = DA[ i ] [ (SA[i] + DA[i]), - ]

Each node is go over at most once DA[ i ] [ (SA[i] + DA[i]), - ] Nodes on this path will not be go over again.

Time Complexity Analysis • Because each node is go over at most once and there are at most 2n nodes in the tree, the time complexity is O(n).

Conclusions • Advantages： • Alphabet 的限制 • 硬碟的I/O • Easy to show • Disadvantages： • 沒有incremental 的特性

Simple Linear Work Suffix Array Construction

Simple Linear Work Suffix Array Construction

Presentation Transcript

Simple Linear Patterns

Linear Lists – Array Representation

Suffix tree and suffix array techniques for pattern analysis in strings

Linear Time Suffix Array Construction Using D-Critical Substrings

Parallel Suffix Array Construction by Accelerated Sampling

Simple Linear Regression

Simple Linear Regression

McCrieght’s algorithm for linear-time suffix tree construction

Suffix tree and suffix array techniques for pattern analysis in strings

On-line Linear-time Construction of Word Suffix Trees

Simple linear regression

Faster Suffix Tree Construction With Missing Suffix Links

Linear-Time Search in Suffix Arrays

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Simple Linear Regression

Two simple full-text indexes based on the suffix array

Simple Linear Regression

Suffix Tree and Suffix Array

Simple linear regression