1 / 67

Seminar in advanced topics in data structures Presented by Kfir Amitai

Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and P. Sanders(2002). Seminar in advanced topics in data structures Presented by Kfir Amitai. Contents. Introduction Searching a suffix array

zoie
Download Presentation

Seminar in advanced topics in data structures Presented by Kfir Amitai

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix ArraysA new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and P. Sanders(2002) Seminar in advanced topics in data structures Presented by Kfir Amitai

  2. Contents • Introduction • Searching a suffix array • Building in O(n logn) - 1993 • Sorting • LCP information building • Some observations about linear time • Building in O(n) - 2002 • Results

  3. Introduction • Until now we observed suffix trees. • The main problem with suffix trees is the coefficient of the linear space complexity. • Suffix arrays present a much simpler data structures • Suffix arrays allow us to search all appearances of a string of size P in a string of size N in O(P+logN) with a kind of binary search.

  4. Introduction – What is a suffix array? • A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S. • S = “nahariya”

  5. The sorted suffixes will be represented by an array of integers - POS

  6. Some definitions and observations • Pos[k] = i  Si is the kth smallest suffix in the set {S0, S1, S2…… SN-1} • For every string u, and a prefix of size p, we denote “<P” as lexicographic order on the first p characters: v <P u  v0v1…vP-1 < u0u1…uP-1 • Note that for every choice of p<N: APOS[0] ≤P APOS[1] ≤P APOS [2] ≤P … ≤P APOS [N-1] • |W| = P .Note that W is a substring of A  there is an isuch that W =P APOS[i]

  7. Contents • Introduction • Searching a suffix array • Building in O(n logn) - 1993 • Sorting • LCP information building • Some observations about linear time • Building in O(n) - 2002 • Results

  8. The Binary search • We define a search interval: LW = min {k | W ≤P APOS[k] or k = N} RW = max {k | W ≥P APOS[k] or k = -1} • W matches aiai+1...ai+P-1 i=POS[k] for some k [LW, RW] • If LW > RW => W is not a substring of A. • Else: there are (RW-LW+1) matches - APOS[LW],…, APOS[RW] POS array W>PAPOS[k] W<PAPOS[k] LW RW

  9. Search example P=|W| LW = min {k | W ≤P APOS[k] or k = N} RW = max {k | W ≥P APOS[k] or k = -1} If found  (RW-LW+1) matches A = assassin Pos Search 1 Search 2 Search 3 Serach 4

  10. Naïve search – O(P logN) • We iterate over POS array in an ordinary binary search. There will be logN iterations of complexity P • Initialize: • L=0 • R=N-1 • Step: • Set M=(L+R)/2 • Set sets new L,R bounds according to a comparison of W with APOS[M]. • Stop if reached LW = min {k | W ≤P APOS[k] or k = N} and RW = max {k | W ≥P APOS[k] or k = -1} W=“abcx” Pos M R L

  11. Stop to think… • What can we do better?

  12. Let’s do it better… • What we didn’t use is the fact that we searching suffixes of the same string… • Let’s assume we have information on the lcp’s of pairs of the suffixes. • For each iteration We define: • l = lcp(APOS[L], W) • r = lcp(W, APOS [R]) • Llcp[M] = lcp(APOS [L] , APOS [M]) • Rlcp[M] = lcp(APOS [M] , APOS [R]) • An important point – we don’t need more than 2*N lcp pairs becuase for each search midpoint M there are well defined L and R!

  13. Search in O(P + logN) using lcp’s • Let’s look for W = “nahx”. • If l≥r we will compare l and Llcp[M] and if l<r, we will compare r and Rlcp[M]. • I will show the case of l≥r, the other case is symmetric. • Case 1 : l < Llcp[M] l = lcp(APOS[L], W) r = lcp(W, APOS [R]) Llcp[M] = lcp(APOS [L] , APOS [M]) Llcp[M]=4 l=3 r=2 R L M

  14. Search in O(P + logN) using lcp’s l = lcp(APOS[L], W) r = lcp(W, APOS [R]) Llcp[M] = lcp(APOS [L] , APOS [M]) Llcp[M]=4 l=3 r=2 • Case 1 : l < Llcp[M] (W = “nahx”) • We know that W>APOS[L] • W>APOS[M] because their lcp is bigger  • We need to move L to be M  • l is unchanged (again, their lcp is bigger) • We did it with no string comparison, only integers R L M

  15. Search in O(P + logN) using lcp’s l = lcp(APOS[L], W) r = lcp(W, APOS [R]) Llcp[M] = lcp(APOS [L] , APOS [M]) Llcp[M]=2 l=3 r=2 • Case 2 : l > Llcp[M] (W = “nahx”) • W and APOS[L] have more in common (bigger lcp)  • Therefore, because we know that APOS[L] < APOS[M] • W < APOS[M] • We need to move R to be M • Now we assign r  Llcp[M] • Again – no string comparison operations R L M

  16. Search in O(P + logN) using lcp’s l = lcp(APOS[L], W) r = lcp(W, APOS [R]) Llcp[M] = lcp(APOS [L] , APOS [M]) Llcp[M]=3 l=3 r=2 • Case 3 : l = Llcp[M] (W = “nahx”) • Now we got to the only case we have to compare strings. We are not sure if we have to go left or right using our lcp information. • What we do know is that the first l characters of W and APOS[M] are similar. • We compare the l+1st character, the l+2nd, and so on, until we find j such that Wl+j ≠l+j APOS[M] • The l+jth character determines if we go left or right. In either way, we know the new value of l/r. R L M

  17. Search in O(P + logN) using lcp’sTime complexity • If we analyze the number of single character comparisons we do in this step, in an amortized manner, we can say that it equals: • ( max(l,r) of last step) – ( max(l,r) initially ) + 1. • All together – not bigger that P, together with the steps, we get O(P + logN)

  18. Search in O(P + logN) using lcp’sSpace complexity • The implementation uses three N-sized arrays of integers – POS, Llcp and Rlcp (that we didn’t show how to use in the example). It is used in the cases were r>l in the same way. • Now we move on to see how to prepare those 3 arrays, whilst sorting.

  19. Contents • Introduction • Searching a suffix array • Building in O(n logn) - 1993 • Sorting • LCP information building • Some observations about linear time • Building in O(n) - 2002 • Results

  20. Sorting the suffixes • We will see a variation of radix sort. • We will sort in O(logN) stages, and call the stages 1,2,4,8,… • We name the stage 2i, H-stage. • In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H) • If Ai, Aj H-bucket, we Sort them by the Next H symbols in the 2H stage.

  21. The general idea • If Ai, Aj H-bucket, we Sort them by the Next H symbols in the 2H stage, but Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H. • first bucket • second bucket • third bucket • fourth bucket Ai Aj Aj+H Ai+H H=2

  22. The sorting algorithm • We go over the semi-sorted suffix array: • The first stage involves only bucket sort of the first character. • Assume the suffixes are now ordered in ≤H order. • For each Ai: Move Ai-H to next available place in its H-bucket. • The suffixes are now sorted according to ≤2H order. • Go over the array again, and decide which suffix opens a new 2H-bucket, use lcp knowledge (will be described later). • In this way, POS will get more and more sorted until every suffix is put in a bucket of it’s own.

  23. An example of A = “assassin” A = assassin A2 A3 H=1 Ai sets Ai-1

  24. An example A = assassin A0 H=1 Ai sets Ai-1 - not possible because i=0

  25. An example A = assassin A5 A6 H=1 Ai sets Ai-1

  26. An example A = assassin A6 A7 H=1 Ai sets Ai-1 – already the first in its bucket

  27. An example A = assassin A2 A1 H=1 Ai sets Ai-1

  28. An example A = assassin A5 A4 H=1 Ai sets Ai-1

  29. An example A = assassin A0 A1 H=1 Ai sets Ai-1

  30. An example A = assassin A3 A4 H=1 Ai sets Ai-1

  31. An example A = assassin H=1 Go over array to get new 2-buckets lcp(sassin,sin)= 1+ lcp(assin,in)= 1+0=1 so “sin” opens a new 2-bucket

  32. An example A = assassin A0 H=2 Ai sets Ai-2 - not possible because i=0

  33. An example A = assassin A3 A1 H=2 Ai sets Ai-2

  34. An example A = assassin A6 A4 H=2 Ai sets Ai-2

  35. An example A = assassin A7 A5 H=2 Ai sets Ai-2

  36. An example A = assassin A0 A2 H=2 Ai sets Ai-2 - but Ai-2 is already the first in its bucket

  37. An example A = assassin A3 A5 H=2 Ai sets Ai-2

  38. An example A = assassin A1 H=2 Ai sets Ai-2 - not possible because i=0

  39. An example A = assassin A2 A4 H=2 Ai sets Ai-2

  40. An example A = assassin H=2 Go over array to get new 4-buckets lcp(assassin,assin)= 2+ lcp(sassin,sin)= 2+1=3 so “assin” opens a new 4-bucket. Lcp(ssassin,ssin)= 2+ lcp(assin,in) = 2+0=2 so “ssin” opens a new 4-bucket. back

  41. An example A = assassin H=4 We are done back

  42. Complexity analysis • First stage (bucket sort) was O(N). • We had log(N) stages, each in O(N): • One traverse for the sorting • One traverse to determine new buckets. • Total time complexity is O(N logN) • Space complexity is: • We hold 3 integer arrays: • POS • PRM which is the inverse of POS: PRM[POS[i]] = I • Another array to tell us who is the last moved suffix in every bucket • We hold 2 Boolean arrays to tell us where are the beginnings of each bucket of this stage and the last stage • All together – O(N). • Still we have to show up we knew the lcp information.

  43. Take a break

  44. Contents • Introduction • Searching a suffix array • Building in O(n logn) - 1993 • Sorting • LCP information building • Some observations about linear time • Building in O(n) - 2002 • Results

  45. lcp information – general idea • We used the lcp information to determine where to split buckets for next iteration. • That’s why we are only interested in two suffixes Ap , Aq such that they are in the same H-bucket, but will not be in the same 2H-bucket. • We also would like to do it concurrently while constructing the array.

  46. lcp information – general idea • Let’s see what we know of such Ap and Aq: • H ≤ lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H •  that is why Ap+H and Aq+H Are in different H-buckets. • What we do is that along the algorithm, we will keep track of the lcp value between neighbors of adjacent buckets. • What about suffixes that are not on adjacent buckets? Slide 41

  47. lcp information – general idea • Let’s notice something – if APOS[i] < APOS[j] then: • lcp( APOS[i] ,APOS[j] ) = {lcp(APOS[k],APOS[k+1])} • That means that their lcp is the minimum of all the adjacent couples between them. H=2 lcp(Ap+H, Aq+H) = min{1,1,2} = 1 1 1 2 Ap Aq Aq+h Ap+h

  48. lcp information – general idea • So, let’s conclude: • We don’t need to hold the lcp every pair, we can obtain it by knowing the minimum of all adjacent pairs between it. • We will hold an array Hgt[N-1] for that purpose. • We will use Interval Trees. • Interval trees are balanced trees that can hold this information for us. Their space complexity is O(N). • We will keep in the leaves the lcp of adjacent pairs, and internal nodes will hold the minimum of their children. • We will be able to obtain the information of any couple in log(N).

More Related