580 likes | 805 Views
Suffix Arrays. A New Method for Online String Searches U.Manber and G.Myers. Introduction - String matching. Let A = a 0 a 1 ... a N- 1 be a large text of length N Let W = w 0 w 1 ... w p- 1 be a word of length P Is W a substring of A?. Introduction - Suffix Trees. Build time
E N D
Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers
Introduction - String matching • Let A = a0a1...aN-1 be a large text of length N • Let W = w0w1...wp-1 be a word of length P • Is W a substring of A?
Introduction - Suffix Trees • Build time • O(N) • Search time • O(P) • Structure space • O(N) • Big constant • Dependent of |Σ|
Suffix Arrays • An array of all the suffixes of A • Sorted by lexicographical order A = aababa
Suffix Arrays • Ai = aiai+1...aN-1 • The suffix of A that starts at position i. • Position array (Pos) • Pos[k] is the start position of kth smallest suffix • APos[k] is the suffix pointed from Pos[k] • APos[k] is the kth smallest suffix Pos A = aababa
Searching • “Is W a substring of A?” • W is a substring of A Some suffix Ai starts with W • i is W’s location • All the instances of W must match consecutive suffixes in the array • Find the array interval that contains those suffixes
Searching - Definitions • For a string u • up = u0u1...up-1 • For strings u,v • u ≤p v up ≤ vp • Same for ≠, =, >… • For any p, Pos is ordered according to ≤p
W =p APos[k] LW RW W >p APos[k] W < p APos[k] Searching - Definitions • W = w0w1…wP-1 • LW= min (k : W ≤pAPos[k]or k = N) • First suffix ≥p from W • RW= max (k : APos[k]≤p W or k = 1) • Last suffix ≤p from W
Search Algorithm • k [LW, RW] W =p APos[k] • To find W’s instances - find [LW, RW] • Number of W’s occurrences is (RW-LW+1) • Matches are APos[LW],…, APos[RW] • Suffix array is sorted - use binary search
Binary Search • Search interval [L,R] • Midpoint M • Compare W to APos[M] • Decide where to search next • W ≤pAPos[M] - search in left half (R = M) • W >pAPos[M] - search in right half (L = M) • O(PlogN) W = abc L M R
Search Algorithm • Observation: • We can use information from one comparison to speedup the next comparisons • Use additional information • lcp = longest common prefix
Search Algorithm - lcp • lcp(v,w) = the length of the longest common prefix of v and w • Obtained by comparing v and w and stopping at the first unequal symbol • Use precomputed lcp information to reduce the number of comparisons to O(P + logN)
Search Algorithm • Consider all possible midpoints • M = 1…N-2 • Every midpoint corresponds to a triplet [LM,M,RM] • Suppose we precomputed two arrays: • Llcp[M] = lcp (APos[LM], APos[M]) • Rlcp[M] = lcp (APos[M], APos[RM])
l = 2 r = 1 LM M RM Llcp[M] = 1 Rlcp[M] = 1 Search Algorithm • Maintain two more variables • l = lcp(APos[L], W) • r = lcp(W, APos[R]) W = abcd
Go Right!l remains unchanged W = abcd l = 2 r = 1 LM M RM Llcp[M] = 3 Rlcp[M] = 1 Search Algorithm • Assume l≥r • Compare l with Llcp • If l < Llcp[M] • W >l+1 APos[LM] • APos[LM] =l+1 APos[M] • W >l+1 APos[M]
Go Left!r = Llcp[M] l = 2 r = 1 LM M RM Llcp[M] = 1 Rlcp[M] = 1 Search Algorithm If l > Llcp[M] • APos[LM] <l APos[M] • W =l APos[LM] • W <l APos[M] W = abcd
l = 2 r = 1 LM M RM Llcp[M] = 2 Rlcp[M] = 1 Search Algorithm If l = Llcp[M] • W can be in either half • Start comparing A and APos[M] from the (l+1) symbol • First unequal symbol determines whether to go right or left • r/l will be updated to l+j • j+1 comparisons W = abcd
Search Algorithm - Complexity • In each Iteration: • Let h=max(l,r) • We start comparing from the hth symbol to the h+j+1 • j+1 symbol comparisons • Next time we will start from the h+j symbol • j symbols out of the j+1 will not be compared again
Search Algorithm - Complexity • Every symbol in W will be successfully matched at most once • O(P) successful comparisons • At most one symbol will be unsuccessfully matched in each iteration • O(logN) unsuccessful comaprsions • Total: O(P + logN) comparisons
Build Suffix Array So far… • A O(P + logN) search algorithm • Given a sorted suffix array • Given lcp information (Llcp, Rlcp) Next… • Sort the suffix array in O(NlogN) • Compute the lcp’s while sorting the array
Sort Algorithm • First stage • Sort the suffixes into buckets, according to first symbol • Inductive stage • Assume array is bucket sorted according to first H symbols • Every H-bucket holds suffixes with the same H first symbols • Buckets are ordered according to the ≤H relation • Sort according to 2H first symbols
Sort Algorithm – Intuition • Let Ai, Aj be two suffixes in the same H-bucket • Ai =H Aj • Next H symbols of Ai and Aj are the first H symbols of Ai+H and Aj+H • In order to determine the ≤2H order of Ai and Aj, look at the ≤H order of Ai+H and Aj+H A = aababaa H = 2 Aj+H Ai Aj Ai+H
Sort Algorithm – Main Idea • Let Ai be a suffix in the first H-bucket • Ai starts with the smallest H-symbol string • Ai-H should be the first in its 2H-bucket A = aababa H = 1
Sort Algorithm • In stage H • Go over all the suffixes in the ≤H order • For each Ai move Ai-H to the next available place in its H-bucket • The suffixes are now sorted according to the ≤2H order • Go on to stage 2H to produce ≤4H order
assin assassin in n ssassin sin ssin sassin H = 1 Sort Algorithm - Example A = assassin assassin assin in sassin sin ssin ssassin H = 2
A = assassin 0 1 2 3 4 5 6 7 assassin assin in n sassin sin ssin ssassin H = 2 assassin assin in n sassin sin ssassin ssin H = 4 Sort Algorithm - Example
Sort Algorithm - Complexity • First Stage • Bucket sort according to first symbol • O(NlogN) • Inductive Stages • O(logN) stages • O(N) per stage • Total O(NlogN) • Space • Can be implemented using two N-sized integer arrays
Finding Longest Common Prefixes • The search algorithm uses lcp information: • Llcp[M] = lcp (APos[LM], APos[M]) • Rlcp[M] = lcp (APos[M], APos[RM]) • We want to compute this information while we are sorting the array
Finding Longest Common Prefixes • Show how to compute lcp’s for suffixes in adjacent H-buckets during the sort algorithm • Use that to compute the lcp’s of all the suffixes that are consecutive in the sorted suffix array • Show how to compute lcps for all the necessary suffixes
Finding LCP for adjacent buckets • After the first sort stage, lcp’s of suffixes in adjacent buckets is 0 • Assume after stage H we know the lcps between suffixes in adjacent H-buckets • Suppose Ap and Aq are in the same H-bucket but not in the same 2H bucket • H ≤ lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H
k [i+1,j] H = 1 i j 2 1 0 Finding LCP for adjacent buckets • Let i,j be Ap+H, Aq+H’spositions in the suffix array • Assume i<j • Array is ordered according to the <H order • lcp(APos[i], APos[j]) = min(lcp(APos[k-1], APos[k]))
LCP Data Structures – Hgt[] • We need a data structure that will allow us: • get the lcp’s of consecutive suffixes • get their minimum • Hgt[] – an N-1 sized array • Hgt[i] = lcp(APos[i-1], APos[i])
k [a+1,b] LCP Data Structures – Hgt[] • Hgt will be computed inductively throughout the sort • Initialized to N+1 • Hgt[i] is updated in stage 2H APos[i] started a new 2H-bucket • To update Hgt[i]: • Let a,b be the array positions of APos[i-1]+H and APos[i] +H • Assume a≤b • Hgt[i] = H + min(Hgt[k])
9 0 0 0 9 9 9 9 0 0 0 9 3 0 0 0 1 1 2 Finding LCP - Example H = 1 H = 2 1 1 lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1 lcp(sassin,sin) = 1 + lcp(assin, in) = 1 H = 4
k [i,j] LCP Data Structures - Interval Tree • We need the following operations for Hgt[]: • Set(i, h) – sets Hgt[i] to h • Min_height(i,j) – determines min(Hgt[k]) • We need to find a way to find the lcp’s for all the necessary suffixes – not just the ones in consecutive positions
LCP Data Structures - Interval Tree • A full and balanced binary tree • N-1 leaves, correspond to Hgt[] • O(logN) height, N-2 interior vertices • Keep a Hgt value for each interior vertex as well: • Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])
LCP Data Structures - Interval Tree • Operations implementation: • Set(i,h) • Set Hgt[i] to h and update the Hgt values on the path from i to the root • Min-height(i,j) • Finds the minimal Hgt value by scanning O(logN) vertices in the tree • Operations complexity – O(logN)
1 0 9 0 0 0 1 9 (0,1) (1,2) (2,3) (3,4) (4,5) (5,6) (6,7) 1 9 0 0 0 9 9 9 Finding LCP – Interval Tree
Finding LCP - Complexity • In stage 2H we update Hgt[i] for all the leaves that started new buckets • Each update is one set operation and one Min_height - O(logN) • Throughout the algorithm every leaf is updated exactly once - O(N) updates • Updates complexity: O(NlogN) • In each stage we scan the array to see which suffixes opened new buckets • Scans complexity: O(NlogN) • Total LCP complexity O(NlogN)
Finding LCP - Llcp[] and Rlcp[] • We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort • Use an interval tree that represents a binary search • Each interior node corresponds to (LM, RM) for some M • For each interior node (LM, RM) • Left(LM, RM) = (LM,M) • Right(LM, RM) = (M, RM) • N-2 interior nodes • Leaves correspond to (i-1,i) • Leaf(i-1,i) = Hgt[i]
Finding LCP - Llcp[] and Rlcp[] • According to interval tree structure: • Hgt[(L,R)] = min(Hgt[k]) • Hgt[(L,R)] = lcp (APos[L], APos[R]) • Llcp[M] = Hgt[(LM,M)] • Rlcp[M] = Hgt[(M,RM)] k [L+1,R]
Suffix Array Build time O(NlogN) Search time O(P+logN) Structure space O(N) 2N - 3N integers Independent of |Σ| Suffix Tree Build time O(N) Search time O(P) Structure space O(N) Big constant Dependent of |Σ| Worst Case Complexity
Expected Time Improvements • Improve the expected case time of • Search Algorithm • Sort Algorithm • LCP computation • Use the following assumptions • All N-symbol strings are equally likely • Under this assumption: • Expected length of longest repeated substring of A is O(log|Σ|N)
Expected Case Improvements - Main Idea • Let T = • Let IntT(u) = integer encoding in base |Σ| of the T-symbol prefix of u • Example: • T = 3 • Σ = a,b • u = abaa • IntT(u) = 010 = 2 • There are |Σ|T ≤ N possible T-symbol prefixes • IntT(u) is a number in [0,N-1] • Map each suffix Ap to IntT(Ap) • Can be done in O(N) time
Expected Case Improvements - Search Algorithm • Use an additional array Buck[] • Think of the sorted array as buckets, based on the IntT encoding • Buck[k] = min{ i | IntT (APos[i]) = k} • The first position that contains a suffix that’s mapped to k • Compute Buck[] • at the end of the sort algorithm • O(N) additional time
Expected Case Improvements - Search Algorithm • Given a word W • We need to find Lw and Rw • Let k = IntT(W) • Lw and Rw must be in k’s bucket • (Buck[k], Buck[k+1]) • We only need to search one bucket
Expected Case Improvements - Search Algorithm • Number of buckets = |Σ|T ≤ N • Average number of elements in a bucket = O(1) • In the binary search for W • Expected size of bucket to search = O(1) • Expected number of search steps: O(1) • Expected case time: O(P)
Expected Case Improvements - Sort Algorithm • First stage of sort • Sort according to first symbol • Replace first stage with sort according to IntT • Equivalent to sort according to first T symbols • Can be done in O(N) time • We changed the base case of the sort from H=1 to H=T
Expected Case Improvements - Sort Algorithm Observation: • Let C be the length of the longest repeated substring of A • Sort is in fact complete once we have reached (C+1)-buckets • Suppose some (C+1)-bucket contains more than one suffix • Then we have two suffixes with lcp > C • This prefix is a repeated substring longer than C - contradiction
Expected Case Improvements - Sort Algorithm • Expected case: • C = O(log|Σ|N) = O(T) • Number of stages: O(1) • Expected case time: O(N)