Suffix Arrays

Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers

Introduction - String matching • Let A = a0a1...aN-1 be a large text of length N • Let W = w0w1...wp-1 be a word of length P • Is W a substring of A?

Introduction - Suffix Trees • Build time • O(N) • Search time • O(P) • Structure space • O(N) • Big constant • Dependent of |Σ|

Suffix Arrays • An array of all the suffixes of A • Sorted by lexicographical order A = aababa

Suffix Arrays • Ai = aiai+1...aN-1 • The suffix of A that starts at position i. • Position array (Pos) • Pos[k] is the start position of kth smallest suffix • APos[k] is the suffix pointed from Pos[k] • APos[k] is the kth smallest suffix Pos A = aababa

Searching • “Is W a substring of A?” • W is a substring of A Some suffix Ai starts with W • i is W’s location • All the instances of W must match consecutive suffixes in the array • Find the array interval that contains those suffixes

Searching - Definitions • For a string u • up = u0u1...up-1 • For strings u,v • u ≤p v up ≤ vp • Same for ≠, =, >… • For any p, Pos is ordered according to ≤p

W =p APos[k] LW RW W >p APos[k] W < p APos[k] Searching - Definitions • W = w0w1…wP-1 • LW= min (k : W ≤pAPos[k]or k = N) • First suffix ≥p from W • RW= max (k : APos[k]≤p W or k = 1) • Last suffix ≤p from W

Search Algorithm • k [LW, RW] W =p APos[k] • To find W’s instances - find [LW, RW] • Number of W’s occurrences is (RW-LW+1) • Matches are APos[LW],…, APos[RW] • Suffix array is sorted - use binary search

Binary Search • Search interval [L,R] • Midpoint M • Compare W to APos[M] • Decide where to search next • W ≤pAPos[M] - search in left half (R = M) • W >pAPos[M] - search in right half (L = M) • O(PlogN) W = abc L M R

Search Algorithm • Observation: • We can use information from one comparison to speedup the next comparisons • Use additional information • lcp = longest common prefix

Search Algorithm - lcp • lcp(v,w) = the length of the longest common prefix of v and w • Obtained by comparing v and w and stopping at the first unequal symbol • Use precomputed lcp information to reduce the number of comparisons to O(P + logN)

Search Algorithm • Consider all possible midpoints • M = 1…N-2 • Every midpoint corresponds to a triplet [LM,M,RM] • Suppose we precomputed two arrays: • Llcp[M] = lcp (APos[LM], APos[M]) • Rlcp[M] = lcp (APos[M], APos[RM])

l = 2 r = 1 LM M RM Llcp[M] = 1 Rlcp[M] = 1 Search Algorithm • Maintain two more variables • l = lcp(APos[L], W) • r = lcp(W, APos[R]) W = abcd

Go Right!l remains unchanged W = abcd l = 2 r = 1 LM M RM Llcp[M] = 3 Rlcp[M] = 1 Search Algorithm • Assume l≥r • Compare l with Llcp • If l < Llcp[M] • W >l+1 APos[LM] • APos[LM] =l+1 APos[M] • W >l+1 APos[M]

Go Left!r = Llcp[M] l = 2 r = 1 LM M RM Llcp[M] = 1 Rlcp[M] = 1 Search Algorithm If l > Llcp[M] • APos[LM] <l APos[M] • W =l APos[LM] • W <l APos[M] W = abcd

l = 2 r = 1 LM M RM Llcp[M] = 2 Rlcp[M] = 1 Search Algorithm If l = Llcp[M] • W can be in either half • Start comparing A and APos[M] from the (l+1) symbol • First unequal symbol determines whether to go right or left • r/l will be updated to l+j • j+1 comparisons W = abcd

Search Algorithm - Complexity • In each Iteration: • Let h=max(l,r) • We start comparing from the hth symbol to the h+j+1 • j+1 symbol comparisons • Next time we will start from the h+j symbol • j symbols out of the j+1 will not be compared again

Search Algorithm - Complexity • Every symbol in W will be successfully matched at most once • O(P) successful comparisons • At most one symbol will be unsuccessfully matched in each iteration • O(logN) unsuccessful comaprsions • Total: O(P + logN) comparisons

Build Suffix Array So far… • A O(P + logN) search algorithm • Given a sorted suffix array • Given lcp information (Llcp, Rlcp) Next… • Sort the suffix array in O(NlogN) • Compute the lcp’s while sorting the array

Sort Algorithm • First stage • Sort the suffixes into buckets, according to first symbol • Inductive stage • Assume array is bucket sorted according to first H symbols • Every H-bucket holds suffixes with the same H first symbols • Buckets are ordered according to the ≤H relation • Sort according to 2H first symbols

Sort Algorithm – Intuition • Let Ai, Aj be two suffixes in the same H-bucket • Ai =H Aj • Next H symbols of Ai and Aj are the first H symbols of Ai+H and Aj+H • In order to determine the ≤2H order of Ai and Aj, look at the ≤H order of Ai+H and Aj+H A = aababaa H = 2 Aj+H Ai Aj Ai+H

Sort Algorithm – Main Idea • Let Ai be a suffix in the first H-bucket • Ai starts with the smallest H-symbol string • Ai-H should be the first in its 2H-bucket A = aababa H = 1

Sort Algorithm • In stage H • Go over all the suffixes in the ≤H order • For each Ai move Ai-H to the next available place in its H-bucket • The suffixes are now sorted according to the ≤2H order • Go on to stage 2H to produce ≤4H order

assin assassin in n ssassin sin ssin sassin H = 1 Sort Algorithm - Example A = assassin assassin assin in sassin sin ssin ssassin H = 2

A = assassin 0 1 2 3 4 5 6 7 assassin assin in n sassin sin ssin ssassin H = 2 assassin assin in n sassin sin ssassin ssin H = 4 Sort Algorithm - Example

Sort Algorithm - Complexity • First Stage • Bucket sort according to first symbol • O(NlogN) • Inductive Stages • O(logN) stages • O(N) per stage • Total O(NlogN) • Space • Can be implemented using two N-sized integer arrays

Finding Longest Common Prefixes • The search algorithm uses lcp information: • Llcp[M] = lcp (APos[LM], APos[M]) • Rlcp[M] = lcp (APos[M], APos[RM]) • We want to compute this information while we are sorting the array

Finding Longest Common Prefixes • Show how to compute lcp’s for suffixes in adjacent H-buckets during the sort algorithm • Use that to compute the lcp’s of all the suffixes that are consecutive in the sorted suffix array • Show how to compute lcps for all the necessary suffixes

Finding LCP for adjacent buckets • After the first sort stage, lcp’s of suffixes in adjacent buckets is 0 • Assume after stage H we know the lcps between suffixes in adjacent H-buckets • Suppose Ap and Aq are in the same H-bucket but not in the same 2H bucket • H ≤ lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H

k [i+1,j] H = 1 i j 2 1 0 Finding LCP for adjacent buckets • Let i,j be Ap+H, Aq+H’spositions in the suffix array • Assume i<j • Array is ordered according to the <H order • lcp(APos[i], APos[j]) = min(lcp(APos[k-1], APos[k]))

LCP Data Structures – Hgt[] • We need a data structure that will allow us: • get the lcp’s of consecutive suffixes • get their minimum • Hgt[] – an N-1 sized array • Hgt[i] = lcp(APos[i-1], APos[i])

k [a+1,b] LCP Data Structures – Hgt[] • Hgt will be computed inductively throughout the sort • Initialized to N+1 • Hgt[i] is updated in stage 2H APos[i] started a new 2H-bucket • To update Hgt[i]: • Let a,b be the array positions of APos[i-1]+H and APos[i] +H • Assume a≤b • Hgt[i] = H + min(Hgt[k])

9 0 0 0 9 9 9 9 0 0 0 9 3 0 0 0 1 1 2 Finding LCP - Example H = 1 H = 2 1 1 lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1 lcp(sassin,sin) = 1 + lcp(assin, in) = 1 H = 4

k [i,j] LCP Data Structures - Interval Tree • We need the following operations for Hgt[]: • Set(i, h) – sets Hgt[i] to h • Min_height(i,j) – determines min(Hgt[k]) • We need to find a way to find the lcp’s for all the necessary suffixes – not just the ones in consecutive positions

LCP Data Structures - Interval Tree • A full and balanced binary tree • N-1 leaves, correspond to Hgt[] • O(logN) height, N-2 interior vertices • Keep a Hgt value for each interior vertex as well: • Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])

LCP Data Structures - Interval Tree • Operations implementation: • Set(i,h) • Set Hgt[i] to h and update the Hgt values on the path from i to the root • Min-height(i,j) • Finds the minimal Hgt value by scanning O(logN) vertices in the tree • Operations complexity – O(logN)

1 0 9 0 0 0 1 9 (0,1) (1,2) (2,3) (3,4) (4,5) (5,6) (6,7) 1 9 0 0 0 9 9 9 Finding LCP – Interval Tree

Finding LCP - Complexity • In stage 2H we update Hgt[i] for all the leaves that started new buckets • Each update is one set operation and one Min_height - O(logN) • Throughout the algorithm every leaf is updated exactly once - O(N) updates • Updates complexity: O(NlogN) • In each stage we scan the array to see which suffixes opened new buckets • Scans complexity: O(NlogN) • Total LCP complexity O(NlogN)

Finding LCP - Llcp[] and Rlcp[] • We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort • Use an interval tree that represents a binary search • Each interior node corresponds to (LM, RM) for some M • For each interior node (LM, RM) • Left(LM, RM) = (LM,M) • Right(LM, RM) = (M, RM) • N-2 interior nodes • Leaves correspond to (i-1,i) • Leaf(i-1,i) = Hgt[i]

Finding LCP - Llcp[] and Rlcp[] • According to interval tree structure: • Hgt[(L,R)] = min(Hgt[k]) • Hgt[(L,R)] = lcp (APos[L], APos[R]) • Llcp[M] = Hgt[(LM,M)] • Rlcp[M] = Hgt[(M,RM)] k [L+1,R]

Suffix Array Build time O(NlogN) Search time O(P+logN) Structure space O(N) 2N - 3N integers Independent of |Σ| Suffix Tree Build time O(N) Search time O(P) Structure space O(N) Big constant Dependent of |Σ| Worst Case Complexity

Expected Time Improvements • Improve the expected case time of • Search Algorithm • Sort Algorithm • LCP computation • Use the following assumptions • All N-symbol strings are equally likely • Under this assumption: • Expected length of longest repeated substring of A is O(log|Σ|N)

Expected Case Improvements - Main Idea • Let T = • Let IntT(u) = integer encoding in base |Σ| of the T-symbol prefix of u • Example: • T = 3 • Σ = a,b • u = abaa • IntT(u) = 010 = 2 • There are |Σ|T ≤ N possible T-symbol prefixes • IntT(u) is a number in [0,N-1] • Map each suffix Ap to IntT(Ap) • Can be done in O(N) time

Expected Case Improvements - Search Algorithm • Use an additional array Buck[] • Think of the sorted array as buckets, based on the IntT encoding • Buck[k] = min{ i | IntT (APos[i]) = k} • The first position that contains a suffix that’s mapped to k • Compute Buck[] • at the end of the sort algorithm • O(N) additional time

Expected Case Improvements - Search Algorithm • Given a word W • We need to find Lw and Rw • Let k = IntT(W) • Lw and Rw must be in k’s bucket • (Buck[k], Buck[k+1]) • We only need to search one bucket

Expected Case Improvements - Search Algorithm • Number of buckets = |Σ|T ≤ N • Average number of elements in a bucket = O(1) • In the binary search for W • Expected size of bucket to search = O(1) • Expected number of search steps: O(1) • Expected case time: O(P)

Expected Case Improvements - Sort Algorithm • First stage of sort • Sort according to first symbol • Replace first stage with sort according to IntT • Equivalent to sort according to first T symbols • Can be done in O(N) time • We changed the base case of the sort from H=1 to H=T

Expected Case Improvements - Sort Algorithm Observation: • Let C be the length of the longest repeated substring of A • Sort is in fact complete once we have reached (C+1)-buckets • Suppose some (C+1)-bucket contains more than one suffix • Then we have two suffixes with lcp > C • This prefix is a repeated substring longer than C - contradiction

Expected Case Improvements - Sort Algorithm • Expected case: • C = O(log|Σ|N) = O(T) • Number of stages: O(1) • Expected case time: O(N)

Suffix Arrays

Suffix Arrays

Presentation Transcript

Suffix trees and suffix arrays

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Approximate String Matching using Compressed Suffix Arrays

Compressed Compact Suffix Arrays

Suffix Trees and Suffix Arrays

Compressed Suffix Arrays based on Run-Length Encoding

Suffix Trees, Suffix Arrays and Suffix Trays

Optimizing multi-pattern searches for compressed suffix arrays

Fine Tuning the Enhanced Suffix Arrays

Suffix Trees and Suffix Arrays

Lecture 17: Suffix Arrays and Burrows Wheeler Transforms

Counting Suffix Arrays and Strings

Suffix arrays

Compressed Suffix Arrays and Suffix Trees

Genomic Repeat Visualisation Using Suffix Arrays

Suffix Trees and Suffix Arrays

Linear-Time Search in Suffix Arrays

Compressed Suffix Arrays

Trie/Suffix Trie/Suffix Tree

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Trie/Suffix Trie/Suffix Tree