1 / 59

Suffix Arrays: A new method for on-line string searches

Suffix Arrays: A new method for on-line string searches. Udi Manber Gene Myers May 1989 Presented by: Oren Weimann. Introduction - Problem definition. “Is W a substring of A?” |A|=N and |W|=P A = a 0 a 1 …a N-1 A i = suffix beginning at index i = a i a i+1 …a N-1. W= badgfbb.

chambray
Download Presentation

Suffix Arrays: A new method for on-line string searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix Arrays:A new method for on-linestring searches • Udi Manber • Gene Myers May 1989 Presented by: Oren Weimann

  2. Introduction - Problem definition “Is W a substring of A?” • |A|=N and |W|=P • A = a0a1…aN-1 • Ai = suffix beginning at index i = aiai+1…aN-1 W= badgfbb A= abccbbadgfbbcahgjf A= abccbbadgfbbcahgjf

  3. Introduction – what is a suffix array? Example: A = assassin Pos[2] = 6 (A6 = in) Pos

  4. Introduction – what is a suffix array? A lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i  Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1}

  5. Introduction – what is a suffix tree? Example: • A trie that contains all suffixes of A: A = assassin s a s s a s s i n s 1 i i a n i n s a n i s s 6 s 5 4 i n i n n 0 3 2

  6. The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.

  7. The Search algorithm - Definitions • For any string u, up = u1u2u3…….up (or u if |u| p) • Let “ “ denote a Lexicographical order, We say u v  up vp • Note that for any choice of p: • Note that W is a substring of A  there is an isuch that W

  8. The Search algorithm – how does the array help us know if W is a substring of A? • We define a search interval: LW = min {k | W APos[k] or k = N} RW = max {k | W APos[k] or k = -1} • W matches ai ai+1 ...ai+P-1 i=Pos[k] for some k [LW, RW]

  9. Example: A = assassin Pos Option 1 Option 2 Option 3

  10. Why finding LW,RW == Finding the matches: • If LW > RW => W is not a substring of A. • Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW] Pos W>APos[k] W<APos[k] LW RW

  11. The Search algorithm –The easy way - O(PlogN) W=“abcx” Pos M R L Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2. In the end LWR

  12. The Search algorithm using lcp values in O(P+logN) – Definitions: Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: • l = lcp(APos[L], W) • r=lcp(W, APos[R]) • Llcp[M] = lcp(APos[L] APos[M]) • Rlcp[M] = lcp(APos[M], APos[R])

  13. The Search algorithm using lcp values in O(P+logN) Example: A=“abcx” l = 3 r = 2 Pos Llcp[M]=4 Rlcp[M]=2 M R L Note that Llcp[M] is well defined because every midpoint M has one LM and one RM

  14. So how do we use l,r,Llcp[M] ?Example: W=abcx Llcp[M]=4 l=3 r=2 R L M Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>APos[L] • W>APos[M] • Go right • l is unchanged = 3

  15. Example: W=abcx (cont.) Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) APos[L] <APos[M] • W<APos[M] • Go left • r = Llcp[M] = 2 Llcp[M]=2 l=3 r=2 M L R

  16. Example: W=abcx (cont.) Llcp[M]=3 r=2 l=3 L M R Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare Wl and APos[M]l until Wl+j APos[M]l+j • Go right or left according to Wl+j, APos[M]l+j • new l or r = (l+j) • Number of comparisons = j+1

  17. The Search algorithm using lcp values-complexity In each iteration there are maximum j+1 comparisons, when in total • Total comparisons (P + #Iterations) • O(P+logN) running time • Requires only 3N-sized arrays

  18. The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.

  19. Construction of suffix array in O(NlogN) Sorting the suffixes in a unique Radix sort – We Will have O(logN) stages (numbered 1,2,4,8,16…) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H– thus, in stage H the suffixes are sorted by )

  20. Construction of suffix array –The general idea If Ai, Aj H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H. first bucket second bucket third bucket fourth bucket Ai Aj Aj+H Ai+H H=2:

  21. Construction of suffix array –The general idea (cont.) • Let Ai be in first H-bucket after stage H • Ai starts with smallest H-symbol string • Ai-H should be first in its H-bucket H=2: Ai Ai-H

  22. Construction of suffix array –The algorithm • Go over the suffix array: • For each Ai: Move Ai-H to next available place in its H-bucket • The suffixes are now sorted according to -order • Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later)

  23. Construction of suffix array –The algorithm Example: A = assassin A2 A3 H=1 Ai sets Ai-1

  24. Construction of suffix array –The algorithm Example: A = assassin A0 H=1 Ai sets Ai-1

  25. Construction of suffix array –The algorithm Example: A = assassin A6 A5 H=1 Ai sets Ai-1

  26. Construction of suffix array –The algorithm Example: A = assassin A6 A7 H=1 Ai sets Ai-1

  27. Construction of suffix array –The algorithm Example: A = assassin A2 A1 H=1 Ai sets Ai-1

  28. Construction of suffix array –The algorithm Example: A = assassin A4 A5 H=1 Ai sets Ai-1

  29. Construction of suffix array –The algorithm Example: A = assassin A0 A1 H=1 Ai sets Ai-1

  30. Construction of suffix array –The algorithm Example: A = assassin A3 A4 H=1 Ai sets Ai-1

  31. Construction of suffix array –The algorithm Example: A = assassin Go over array to get new 2-buckets lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket H=1 Ai sets Ai-1 back

  32. Construction of suffix array –The algorithm Example: A = assassin A0 H=2 Ai sets Ai-2

  33. Construction of suffix array –The algorithm Example: A = assassin A1 A3 H=2 Ai sets Ai-2

  34. Construction of suffix array –The algorithm Example: A = assassin A4 A6 H=2 Ai sets Ai-2

  35. Construction of suffix array –The algorithm Example: A = assassin A7 A5 H=2 Ai sets Ai-2

  36. Construction of suffix array –The algorithm Example: A = assassin A2 A0 H=2 Ai sets Ai-2

  37. Construction of suffix array –The algorithm Example: A = assassin A3 A5 H=2 Ai sets Ai-2

  38. Construction of suffix array –The algorithm Example: A = assassin A1 H=2 Ai sets Ai-2

  39. Construction of suffix array –The algorithm Example: A = assassin A2 A4 H=2 Ai sets Ai-2

  40. Construction of suffix array –The algorithm Example: A = assassin Go over array to get new 4-buckets H=2 Ai sets Ai-2

  41. Construction of suffix array –The algorithm Example: A = assassin That’s it, we are sorted! H=4

  42. Construction of suffix array –Complexity Summary • Sorting by first char – O(N) • O(logN) stages of O(N) operations = O(NlogN) • Total - time: O(NlogN) - space: 2 integer arrays of size N back

  43. The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.

  44. How to find Longest Common Prefixes – the general idea • We don’t care what is the lcp between suffixes in the same H-bucket. • For Ap, Aq in the same H-bucket but different 2H-buckets: • H lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H  that is why Ap+H,Aq+H Are in different H-buckets, but which ones?

  45. How to find Longest Common Prefixes – the general idea • If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how? • If not, Then: lcp(APos[i], APos[j]) = {lcp(APos[k],APos[k+1])}

  46. How to find Longest Common Prefixes – the general idea lcp(Ap+H, Aq+H) = min{1,1,2} = 1 H=2 1 1 2 Ap+h Aq+h Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H

  47. How to find lcp – algorithm and data structures – Hgt[] During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]), initialized so that Hgt[i]=N+1 for every i. • In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. • In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket

  48. H=1 assin assassin in n sin ssin sassin ssassin 9 0 0 0 9 9 9 H=2 assin assassin in n sassin sin ssin ssassin 9 0 0 0 1 1 9 How to find lcp – Hgt[] example: lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1

  49. H=4 ssin assassin assin in n sassin sin ssassin 3 0 0 0 1 1 2 How to find lcp – Hgt[] example (cont.) lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

  50. How to find lcp –data structures We need a data structure that will contain lcp(APos[j], APos[i]) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree.

More Related