860 likes | 1.23k Views
What is Suffix Sorting?. Given a string, give the sorted order of its suffixes.Why would we want that?More on this later.. Example: Mississippi$. How fast can we sort?. Sorting suffixes of a string can be no faster than sorting the characters of a string.Can we match the sorting lower bound?. Suffix Sorting.
E N D
1. Suffix Sorting & Related Algoritmics Martin Farach-Colton
Rutgers University
USA
2. What is Suffix Sorting? Given a string, give the sorted order of its suffixes.
Why would we want that?
More on this later.
3. Example: Mississippi$
4. How fast can we sort? Sorting suffixes of a string can be no faster than sorting the characters of a string.
Can we match the sorting lower bound?
5. Suffix Sorting We are sorting strings, so Radix Sort is natural.
This yields an algorithm with time O(n2) to O(n2logn), depending on assumptions on sortability of characters.
O(n2logn) in comparison model.
O(n2) for small integers. In between in general word model.
We can do better by combining Merge Sort with Radix Sort.
6. Building Blocks Range Reduction
Radix Step
Chunking
7. Range Reduction Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.
8. Example: Mississippi$
9. Range Reduction Our only range reduction operation will be:
Replace every character by rank in sorted order of characters.
After RR, length n string will be in [n]n
10. RR helps running time Radix Sort on raw input might take O(n2logn) time.
RR takes at most O(nlogn) time.
Radix Sort on small-integer inputs takes O(n2) time.
Total time is O(n2).
11. Radix Step Recall that Radix Sort proceeds in steps:
Lexicographically sort the last i characters of each string.
Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters.
12. Using Radix Step If we have recursively sorted some subset of suffixes of a string:
1 step of radix will sort the preceding suffixes.
Ex: If you have sorted suffixes at odd positions, then you get suffixes at even positions.
13. Example: 214414413315
14. Example: 214414413315
15. Radix Step We normally think of Radix Sort as sorting one set of strings.
But with suffixes (which are interrelated), we can use a Radix Step to get sorted orders of one set from another.
16. Where are we now? Step 1: Recursively sort odd suffixes.
How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that.
Step 2: Get even suffixes in linear time.
By Radix Step.
Step 3: Merge!
17. Merging is tricky... F ‘97 gave first linear time solution.
This yields an optimal suffix sorting routine.
It’s a fun algorithm, but highly unintuitive.
Or so I’ve been told!
18. Chunking Let’s solve the recursion problem first.
Given two integers i and j, let <i,j> be their bit concatenation.
If i,j?[n], then <i,j>?[n2].
Given a string S = (s1,s2,...,sn)
Let S’ = (<s1,s2>,<s3,s4>,...,<sn-1,sn>)
19. Chunking + Recursion: I Observation: The order of the odd suffixes of S = (s1,s2,...,sn) is the same as the order of all suffixes of S’ = (<s1,s2>,<s3,s4>,...,<sn-1,sn>)
Since bit concatenation preserves lexicographic ordering.
20. Example: 214414413315
21. Chunking + Recursion: II Chunking+Range Reduction = Recursion
Input is in [n]n.
Chunked Input is in [n2]n/2.
Range Reduced Chunking is in [n/2]n/2.
So now problem instance is half the size and we can recurse.
22. Example: 214414413315
23. Recall Basic Operations Range Reduction
Radix Step
Chunking
How we are ready for the whole algorithm.
24. Suffix Sorting Step 1: Chunk + Range Reduction.
Recurse on new string.
Get sorted order of odd suffixes.
Step 2: Radix Step.
Get sorted order of even suffixes.
Step 3: Merge!
We still don’t know how to do this.
25. The Trouble with Merging Let’s start merging. We need to see which is smaller: the smallest odd suffix s2i-1 or the smallest even suffix s2j.
We can compare the first character.
We can then compare s2i with s2j
But this is another odd/even comparison.
26. The difference between 3 and 2 It’s possible to merge the lists.
But Kärkkäinen & Sanders showed the cute way to merge.
The modified the recursion to make the merge easy.
27. Mod 3 Recursion Given a string S = (s1,s2,...,sn)
Let S1 = (<s1,s2,s3>,<s4,s5,s6>,...,<sn-2,sn-1,sn>)
Let S2 = (<s2,s3,s4>,<s5,s6,s7>,...,<sn-4,sn-3,sn-2>)
Let O12 be order of suffix congruent to 1 and 2 mod 3.
You get this recursively from sorting the suffixes of S1S2
28. Radix Step x 2 We have O12 from the recursion.
One Radix Step gives us O01
Another Radix Step gives us O02
Each pair of suffix is now compared in one list.
Each suffix appears in two lists.
29. Merging... at last! To merge O12, O01, and O12 note that:
The smallest suffix is the first on two lists.
Pop the smallest.
Now the next smallest is the first on two lists.
Etc.
30. Total time T(n) to sort suffix of strings in [n]n
T(n) = recursion + 2*radix + merging
T(n) = T(2n/3) + O(n) + O(n)
T(n) = O(n)
So the initial Range Reduction step into the integer alphabet is the bottleneck.
So this algorithm is optimal for any alphabet.
31. Why did we want to sort suffixes anyway? It’s easy to go from Suffix Sorting to Suffix Arrays...
32. Suffix Arrays A suffix array is:
The sorted order of suffixes
Their pairwise adjacent lcp’s.
They are handy as space efficient indexes.
How do we compute the lcp’s from the suffix order?
33. Example: Mississippi$
34. Example: Mississippi$
35. Example: Mississippi$
36. Example: Mississippi$
37. Example: Mississippi$
38. Example: Mississippi$
40. What’s a suffix tree? Compacted trie of all suffixes of a string.
What’s it good for?
Too many things to enumerate...
See the stringology literature of the last 30 years!
41. Example: Mississippi$
42. Example: Mississippi$
43. Example: Mississippi$
44. Example: Mississippi$
45. Example: Mississippi$
46. Example: Mississippi$
47. Example: Mississippi$
48. Example: Mississippi$
49. Example: Mississippi$
50. Example: Mississippi$
51. Example: Mississippi$
52. Example: Mississippi$
53. Brute-force Algorithm We insert one suffix at a time.
Each suffix insertion takes O(n) time.
Total time is O(n2)
54. How low can we go? There is a leaf for each suffix
That’s why we put a $ at end.
Each internal node is branching
Because we have a compacted trie.
Each edge has a constant-size label.
Just a pointer to string + length.
So suffix tree has size O(n).
55. Weiner’s Algorithm Weiner showed a suffix-tree construction in O(n) time for binary alphabet.
Still adds suffixes one at a time.
But gets a speed-up on insertions by exploiting Suffix Links.
56. Defs: LCP Let li be the leaf for suffix S[1,i].
Let LCP(i,j) be the length of the longest common prefix of S[1,i] and S[1,j].
Ex: LCP(2,5) = 4 for Mississippi.
57. Suffix Links Claim: If some node in a suffix tree has string aa, for a a character and a a string, then some node in the suffix tree has string a.
58. Example: Mississippi$
59. Example: Mississippi$
60. Example: Mississippi$
61. Example: Mississippi$
62. Suffix Links: Proof How do we know a suffix link always exists?
Maybe the link points into the middle of a edge...
63. Suffix Links: Proof Cartoon
64. Adding Suffix Links Adding each suffix link is just a least common ancestors computation
This takes O(n) preprocessing + O(1) per lca.
Adding all suffix links is O(n) time.
65. Suffix Links Uses The first use of suffix links was to speed up suffix tree construction.
If you keep suffix links on the partially constructed tree, you don’t have to start every insertion from the root.
Yields a linear time construction!
66. So we’re done! The data structure has linear size.
So linear lower bound.
The algorithm runs in linear time.
So linear upper bound.
Declare victory and go home!
67. Alphabets everywhere. This algorithm runs in linear time for binary alphabet.
For an alphabet of size s, it runs in O(n log s).
68. Lower Bounds Element uniqueness reduces to suffix tree construction, so in the algebraic decision tree model, the lower bound is Omega(n log n).
If we require edges to be sorted, then sorting is lower bound.
69. Open Problem Is there an interesting lower bound for suffix trees where each node may have an arbitrary order of children?
Such a tree is no good as an index, but just fine for LCA applications of suffix trees.
70. Back to Suffix Links Fun fact: The suffix links form a sl-tree. The length of the string at a node is the depth in the sl-tree.
71. Example: Mississippi$
72. Stripping a Suffix Tree: I If we are given a suffix tree with no edge labels, we can reconstruct the labels in linear time:
Compute Suffix Links (by LCAs).
Compute Depth in SL-tree.
Each node selects a leaf and computes offset within its descendant suffix.
73. Example: Mississippi$
74. Example: Mississippi$
75. Example: Mississippi$
76. Suffix Arrays A suffix array is:
The sorted order of suffixes
Their pairwise adjacent lcp’s.
77. Example: Mississippi$
78. Suffix Arrays They are much more space efficient than suffix trees.
You can still use them as indexes.
You can build them as fast as suffix trees...
By dfs of suffix tree. How about w/o suffix trees?
79. Building suffix trees from suffix arrays You can build the suffix tree left to right by keeping a stack of the right-most path.
This gives you the shape of the tree.
Then add link labels.
Total construction is linear for any alphabet.
80. Example: Mississippi$
81. Example: Mississippi$
82. Example: Mississippi$
83. Example: Mississippi$
84. Example: Mississippi$
85. Even Less Information! You don’t even need LCP information to build the suffix tree.
You can compute LCP array from suffix order array in linear time.
Big Conclusion: Given sorted order of suffixes, you can build the suffix array in linear time for any alphabet.
86. How do we sort suffixes? Mergesort!
Careful choice of recursion.
Careful merging.
87. How do we sort suffixes? Key Operations:
Range Reduction.
Radix Sorting.
Clumping.