430 likes | 624 Views
July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park. Linear-Time Search in Suffix Arrays. Suffix arrays. Suffix array of text T The lexicographically sorted list of all suffixes of text T. Suffix arrays. Example for T = abbabaababbb#
E N D
July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park Linear-Time Search in Suffix Arrays
Suffix arrays • Suffix array of text T • The lexicographically sorted list of all suffixes of text T
Suffix arrays • Example for T = abbabaababbb# • The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. # is the lexicographically smallest special character.
Suffix arrays • Example for T = abbabaababbb# • The suffixes of T are abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored.
Suffix arrays • Definition:s-suffixes • Suffixes starting with strings • a-suffixes, ba-suffixes, …
Suffix arrays vs. Suffix trees • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , (p=|P|, n=|T|) • Suffix Tree:
Contribution • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , , • Suffix Tree:
The meaning of our contribution • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , , • Suffix Tree: Search time: SA ST
The meaning of our contribution • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , , • Suffix Tree: Search time: SA ST Suffix arrays are more powerful than suffix trees.
Our search algorithm • Our search algorithm
Search in a suffix array • Definition: Search in a suffix array • Input • A pattern P • A suffix array of T • Output • All P-suffixes of T
Search in a suffix array All ab-suffixes are neighbors. • A search example P = ab T = abbabaababbb# Find all ab-suffixes.
Search in a suffix array We have only to find the first and the lastab-suffixes. Because the other ab-suffixes are stored between them. • A search example P = ab T = abbabaababbb#
Related work • In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001). • Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm • Search P from the last character to the first character of P • P = ababaaabb • We adopt this backward pattern searching idea.
Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. Our algorithm has p stages (In this case, there are 3 stages.)
Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. Stage 1: find all a-suffixes.
Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes.
Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. stage 3: find all aba-suffixes.
Elaborate stage 2 • A stage • by elaborating stage 2 P = aba We find all ba-suffixes using a-suffixes found in stage 1. We find the firstba-suffix fromthe firsta-suffix and the lastba-suffix fromthe lasta-suffix.
Elaborate stage 2 • A stage • by elaborating stage 2 P = aba Only explain how tofind the first ba-suffixfromthe firsta-suffix. Finding the lastba-suffixis similar.
Elaborate stage 2 P = aba To find the firstba-suffix, we count the number of suffixes that precedeba-suffixesin this suffix array.
Elaborate stage 2 Suffixes preceding ba-suffixes are divided into two categories. - A-type: Suffixes starting with characters lexicographically smaller than b. (#-suffixes, a-suffixes) - B-type: Suffixes starting with the same characterband preceding ba-suffixes. We count A-type and B-type suffixes in different ways. A-type B-type
Count the number of A-type suffixes • Countthe number of A-type suffixes A-type The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix.
Count the number of A-type suffixes • We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b-suffix. • With this array, we can count A-type suffixes in O(1) time.
Count the number of A-type suffixes • Array • S pace: • Time: O(n) (one scan)
Count the number of B-type suffixes • Count B-type suffixes • b-suffixes preceding ba-suffixes. B-type
Count the number of B-type suffixes • B-type suffixes • b-suffixes preceding ba-suffixes. • A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1. B-type
Count the number of B-type suffixes U • The number of B-type suffixes are the number of suffixes • being in a suffix subarray that precedes a-suffixes • whose previous characters are bs B-type We count this with arrayN. Let U be the conceptual array of previous characters of suffixes.
Count the number of B-type suffixes U • ArrayN • entries N[i,b] storesthe number of suffixes whose previous characters are bs in a suffix subarray SA[1,i].
Count the number of B-type suffixes U We can countB-type suffixes in O(1) time by accessing an entry of N.
Array N • Array N • Space: • An alternative way • Space: O(n) • time for counting B-type suffixes.
Query for N[i,b] • Counting B-type suffixes • O(log n) time • O(log ) time
Query for N[i,b] O(log n) time U In an O(log n) time algorithm, we generate an array whose ith entry stores the location of the ith b in U.
Query for N[i,b]: O(log n) time U To count suffixes whose previous characters are bs in SA[1,8]. = To count bs in U[1,8]
Query for N[i,b]: O(log n) time U Find the largest value not exceeding 8 in this array.
Query for N[i,b]: O(log n) time U To find 7 in this array, we perform binary search. O(log n)time.
Query for N[i,b]: O(log n) time U The index of 7 (5) is the number of b’s in U[1,8].
Query for N[i,b]: O(log n) time U Generally, we require arrays for all characters. # O(n) space a b
Query for N[i,b] • O(log n) time • O(log ) time
Query for N[i,b]: time U Divide U into -sized blocks. For the last characters of each block, we compute the entries of N.
Query for N[i,b]: time U For the other entries in each block, we generate a similar data structure used in O(log n) time alg. O(log ) time for binary search. Still O(n) space in total.
Summary • pstages • Each stage • Count A-type suffixes • Time: O(1) • Space: O(n) for M array • Count B-type suffixes • Time: • Space: O(n) for computing the value of an entry N • In total, time with O(n) space.
Conclusion • In a suffix array, one can choose or search time algorithm depending on the alphabet size. • Suffix arrays are more powerful than suffix trees.