800 likes | 1.08k Views
Suffix arrays. Suffix array. We loose some of the functionality but we save space. Let s = abab. Sort the suffixes lexicographically: ab, abab, b, bab. The suffix array gives the indices of the suffixes in sorted order. 2. 0. 3. 1. How do we build it ?. Build a suffix tree
E N D
Suffix array • We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2 0 3 1
How do we build it ? • Build a suffix tree • Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. • O(n) time
How do we search for a pattern ? • If P occurs in T then all its occurrences are consecutive in the suffix array. • Do a binary search on the suffix array • Takes O(mlogn) time
10 7 4 1 0 9 8 6 3 5 2 Example Let S = mississippi i L ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R
How do we accelerate the search ? Maintain l = LCP(P,L) Maintain r = LCP(P,R) Assume l ≥ r r l L M R
l > r r l L M R
Someone whispers LCP(L,M) LCP(L,M)> l r l L M R
Continue in the right half LCP(L,M)> l r l L M R
LCP(L,M)< l r l L M R
Continue in the left half LCP(L,M)< l r l L M R
LCP(L,M)= l start comparing M to P at l + 1 r l L M R
Analysis If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison O(m + logn) time
Linear time construction Recursively ? Say we want to sort only suffixes that start at even positions ?
Change the alphabet Every pair of characters is now a character You in fact sort suffixes of a string shorter by a factor of 2 !
Change the alphabet a a b a a b $ 2 1 2
Divide into triples y a b b a b o d a b a d $ abb ada bba do$
Divide into triples y a b b a b o d a b a d $ abb ada bba do$ y a b b a b o d a b a d $ bba dab bad o$$
3 7 0 1 6 4 2 5 10 11 1 4 8 2 7 5 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 Sort recursively 2/3 of the suffixes 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 0 1 2 3 4 7 5 6 abb ada bba do$ bba dab bad o$$ 3 7 1 2 4 6 4 5
10 11 1 4 8 2 7 5 Sort the remaining third 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 (a, 7) (y, 1) (b, 2) (a, 5) (y, 1) (a, 7) (b, 2) (a, 5) 0 9 3 6
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 6 10 11 1 4 8 2 7 5 1
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 6 10 11 4 8 2 7 5 1 6
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 10 11 4 8 2 7 5 1 6 4
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 9 3 10 11 8 2 7 5 1 6 4 9
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 3 10 11 8 2 7 5 1 6 4 9 3
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 8 2 7 5 1 6 4 9 3 8
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 2 7 5 1 6 4 9 3 8 2
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 7 5 1 6 4 9 3 8 2 7
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 5 1 6 4 9 3 8 2 7 5
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 10 11 1 6 4 9 3 8 2 7 5
Merge 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 0 1 6 4 9 3 8 2 7 5 10 11
summary 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 7 8 1 4 2 6 5 3 1 6 4 9 3 8 2 7 5 10 11 0 When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes
Compute LCP’s 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
Crucial observation 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)} bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
Find LCP’s of consecutive suffixes 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(11,0) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(8,2) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(9,3) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 1 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(6,4) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 1 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(7,5) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 1 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(1,6) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(2,7) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(3,8) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(4,9) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 1 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(5,10) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 yabbadabbado$ 0 0 5 4 1 3 1 0 0 2 1 0 o$ 11 do$ 10 dabbado$ 5 bbado$ 7 bbadabbado$ 2 LCP(10,11) bado$ 8 badabbado$ 3 ado$ 9 adabbado$ 4 abbado$ 6 abbadabbado$ 1
We need more LCPs for search 1 2 3 4 7 8 9 10 11 12 5 6 0 y a b b a b o d a b a d $ 1 6 4 9 3 8 2 7 5 10 11 0 0 5 4 1 3 1 0 0 2 1 0 Linearly many, calculate the all bottom up
Another example 2 3 4 7 8 5 6 9 1 a b c a b a b c $ 4 1 8 5 2 6 3 7 9 abbca$ 4 0 1 2 0 3 2 0 1 abcabbca$ 1 a$ 8 bbca$ 5 bcabbca$ 2 bca$ 6 cabbca$ 3 ca$ 7 $ 9