1 / 43

Linear-Time Search in Suffix Arrays

July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park. Linear-Time Search in Suffix Arrays. Suffix arrays. Suffix array of text T The lexicographically sorted list of all suffixes of text T. Suffix arrays. Example for T = abbabaababbb#

george
Download Presentation

Linear-Time Search in Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park Linear-Time Search in Suffix Arrays

  2. Suffix arrays • Suffix array of text T • The lexicographically sorted list of all suffixes of text T

  3. Suffix arrays • Example for T = abbabaababbb# • The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. # is the lexicographically smallest special character.

  4. Suffix arrays • Example for T = abbabaababbb# • The suffixes of T are abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored.

  5. Suffix arrays • Definition:s-suffixes • Suffixes starting with strings • a-suffixes, ba-suffixes, …

  6. Suffix arrays vs. Suffix trees • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , (p=|P|, n=|T|) • Suffix Tree:

  7. Contribution • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , , • Suffix Tree:

  8. The meaning of our contribution • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , , • Suffix Tree: Search time: SA ST

  9. The meaning of our contribution • Construction time • Suffix Array = Suffix Tree • Space • Suffix Array = Suffix Tree • In practice , suffix arrays are more space efficient than suffix trees. • Search time • Suffix Array: , , • Suffix Tree: Search time: SA ST Suffix arrays are more powerful than suffix trees.

  10. Our search algorithm • Our search algorithm

  11. Search in a suffix array • Definition: Search in a suffix array • Input • A pattern P • A suffix array of T • Output • All P-suffixes of T

  12. Search in a suffix array All ab-suffixes are neighbors. • A search example P = ab T = abbabaababbb# Find all ab-suffixes.

  13. Search in a suffix array We have only to find the first and the lastab-suffixes. Because the other ab-suffixes are stored between them. • A search example P = ab T = abbabaababbb#

  14. Related work • In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001). • Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm • Search P from the last character to the first character of P • P = ababaaabb • We adopt this backward pattern searching idea.

  15. Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. Our algorithm has p stages (In this case, there are 3 stages.)

  16. Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. Stage 1: find all a-suffixes.

  17. Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes.

  18. Algorithm outline • Outline of our search algorithm P = aba T = abbabaababbb# We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. stage 3: find all aba-suffixes.

  19. Elaborate stage 2 • A stage • by elaborating stage 2 P = aba We find all ba-suffixes using a-suffixes found in stage 1. We find the firstba-suffix fromthe firsta-suffix and the lastba-suffix fromthe lasta-suffix.

  20. Elaborate stage 2 • A stage • by elaborating stage 2 P = aba Only explain how tofind the first ba-suffixfromthe firsta-suffix. Finding the lastba-suffixis similar.

  21. Elaborate stage 2 P = aba To find the firstba-suffix, we count the number of suffixes that precedeba-suffixesin this suffix array.

  22. Elaborate stage 2 Suffixes preceding ba-suffixes are divided into two categories. - A-type: Suffixes starting with characters lexicographically smaller than b. (#-suffixes, a-suffixes) - B-type: Suffixes starting with the same characterband preceding ba-suffixes. We count A-type and B-type suffixes in different ways. A-type B-type

  23. Count the number of A-type suffixes • Countthe number of A-type suffixes A-type The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix.

  24. Count the number of A-type suffixes • We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b-suffix. • With this array, we can count A-type suffixes in O(1) time.

  25. Count the number of A-type suffixes • Array • S pace: • Time: O(n) (one scan)

  26. Count the number of B-type suffixes • Count B-type suffixes • b-suffixes preceding ba-suffixes. B-type

  27. Count the number of B-type suffixes • B-type suffixes • b-suffixes preceding ba-suffixes. • A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1. B-type

  28. Count the number of B-type suffixes U • The number of B-type suffixes are the number of suffixes • being in a suffix subarray that precedes a-suffixes • whose previous characters are bs B-type We count this with arrayN. Let U be the conceptual array of previous characters of suffixes.

  29. Count the number of B-type suffixes U • ArrayN • entries N[i,b] storesthe number of suffixes whose previous characters are bs in a suffix subarray SA[1,i].

  30. Count the number of B-type suffixes U We can countB-type suffixes in O(1) time by accessing an entry of N.

  31. Array N • Array N • Space: • An alternative way • Space: O(n) • time for counting B-type suffixes.

  32. Query for N[i,b] • Counting B-type suffixes • O(log n) time • O(log ) time

  33. Query for N[i,b] O(log n) time U In an O(log n) time algorithm, we generate an array whose ith entry stores the location of the ith b in U.

  34. Query for N[i,b]: O(log n) time U To count suffixes whose previous characters are bs in SA[1,8]. = To count bs in U[1,8]

  35. Query for N[i,b]: O(log n) time U Find the largest value not exceeding 8 in this array.

  36. Query for N[i,b]: O(log n) time U To find 7 in this array, we perform binary search. O(log n)time.

  37. Query for N[i,b]: O(log n) time U The index of 7 (5) is the number of b’s in U[1,8].

  38. Query for N[i,b]: O(log n) time U Generally, we require arrays for all characters. # O(n) space a b

  39. Query for N[i,b] • O(log n) time • O(log ) time

  40. Query for N[i,b]: time U Divide U into -sized blocks. For the last characters of each block, we compute the entries of N.

  41. Query for N[i,b]: time U For the other entries in each block, we generate a similar data structure used in O(log n) time alg. O(log ) time for binary search. Still O(n) space in total.

  42. Summary • pstages • Each stage • Count A-type suffixes • Time: O(1) • Space: O(n) for M array • Count B-type suffixes • Time: • Space: O(n) for computing the value of an entry N • In total, time with O(n) space.

  43. Conclusion • In a suffix array, one can choose or search time algorithm depending on the alphabet size. • Suffix arrays are more powerful than suffix trees.

More Related