Suffix Arrays: A new method for on-line string searches

Suffix Arrays:A new method for on-linestring searches • Udi Manber • Gene Myers May 1989 Presented by: Oren Weimann

Introduction - Problem definition “Is W a substring of A?” • |A|=N and |W|=P • A = a0a1…aN-1 • Ai = suffix beginning at index i = aiai+1…aN-1 W= badgfbb A= abccbbadgfbbcahgjf A= abccbbadgfbbcahgjf

Introduction – what is a suffix array? Example: A = assassin Pos[2] = 6 (A6 = in) Pos

Introduction – what is a suffix array? A lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i  Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1}

Introduction – what is a suffix tree? Example: • A trie that contains all suffixes of A: A = assassin s a s s a s s i n s 1 i i a n i n s a n i s s 6 s 5 4 i n i n n 0 3 2

The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.

The Search algorithm - Definitions • For any string u, up = u1u2u3…….up (or u if |u| p) • Let “ “ denote a Lexicographical order, We say u v  up vp • Note that for any choice of p: • Note that W is a substring of A  there is an isuch that W

The Search algorithm – how does the array help us know if W is a substring of A? • We define a search interval: LW = min {k | W APos[k] or k = N} RW = max {k | W APos[k] or k = -1} • W matches ai ai+1 ...ai+P-1 i=Pos[k] for some k [LW, RW]

Example: A = assassin Pos Option 1 Option 2 Option 3

Why finding LW,RW == Finding the matches: • If LW > RW => W is not a substring of A. • Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW] Pos W>APos[k] W<APos[k] LW RW

The Search algorithm –The easy way - O(PlogN) W=“abcx” Pos M R L Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2. In the end LWR

The Search algorithm using lcp values in O(P+logN) – Definitions: Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: • l = lcp(APos[L], W) • r=lcp(W, APos[R]) • Llcp[M] = lcp(APos[L] APos[M]) • Rlcp[M] = lcp(APos[M], APos[R])

The Search algorithm using lcp values in O(P+logN) Example: A=“abcx” l = 3 r = 2 Pos Llcp[M]=4 Rlcp[M]=2 M R L Note that Llcp[M] is well defined because every midpoint M has one LM and one RM

So how do we use l,r,Llcp[M] ?Example: W=abcx Llcp[M]=4 l=3 r=2 R L M Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>APos[L] • W>APos[M] • Go right • l is unchanged = 3

Example: W=abcx (cont.) Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) APos[L] <APos[M] • W<APos[M] • Go left • r = Llcp[M] = 2 Llcp[M]=2 l=3 r=2 M L R

Example: W=abcx (cont.) Llcp[M]=3 r=2 l=3 L M R Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare Wl and APos[M]l until Wl+j APos[M]l+j • Go right or left according to Wl+j, APos[M]l+j • new l or r = (l+j) • Number of comparisons = j+1

The Search algorithm using lcp values-complexity In each iteration there are maximum j+1 comparisons, when in total • Total comparisons (P + #Iterations) • O(P+logN) running time • Requires only 3N-sized arrays

The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.

Construction of suffix array in O(NlogN) Sorting the suffixes in a unique Radix sort – We Will have O(logN) stages (numbered 1,2,4,8,16…) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H– thus, in stage H the suffixes are sorted by )

Construction of suffix array –The general idea If Ai, Aj H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H. first bucket second bucket third bucket fourth bucket Ai Aj Aj+H Ai+H H=2:

Construction of suffix array –The general idea (cont.) • Let Ai be in first H-bucket after stage H • Ai starts with smallest H-symbol string • Ai-H should be first in its H-bucket H=2: Ai Ai-H

Construction of suffix array –The algorithm • Go over the suffix array: • For each Ai: Move Ai-H to next available place in its H-bucket • The suffixes are now sorted according to -order • Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later)

Construction of suffix array –The algorithm Example: A = assassin A2 A3 H=1 Ai sets Ai-1

Construction of suffix array –The algorithm Example: A = assassin A0 H=1 Ai sets Ai-1

Construction of suffix array –The algorithm Example: A = assassin Go over array to get new 2-buckets lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket H=1 Ai sets Ai-1 back

Construction of suffix array –The algorithm Example: A = assassin Go over array to get new 4-buckets H=2 Ai sets Ai-2

Construction of suffix array –The algorithm Example: A = assassin That’s it, we are sorted! H=4

Construction of suffix array –Complexity Summary • Sorting by first char – O(N) • O(logN) stages of O(N) operations = O(NlogN) • Total - time: O(NlogN) - space: 2 integer arrays of size N back

The Article Overview • A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). • How to construct Pos[ ] in O(NlogN) time and O(N) space. • An Algorithm for computing the lcp information in O(NlogN). • Algorithms for Expected-time improvement.

How to find Longest Common Prefixes – the general idea • We don’t care what is the lcp between suffixes in the same H-bucket. • For Ap, Aq in the same H-bucket but different 2H-buckets: • H lcp(Ap, Aq) < 2H • lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) • lcp(Ap+H, Aq+H) < H  that is why Ap+H,Aq+H Are in different H-buckets, but which ones?

How to find Longest Common Prefixes – the general idea • If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how? • If not, Then: lcp(APos[i], APos[j]) = {lcp(APos[k],APos[k+1])}

How to find Longest Common Prefixes – the general idea lcp(Ap+H, Aq+H) = min{1,1,2} = 1 H=2 1 1 2 Ap+h Aq+h Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H

How to find lcp – algorithm and data structures – Hgt[] During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]), initialized so that Hgt[i]=N+1 for every i. • In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. • In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket

H=1 assin assassin in n sin ssin sassin ssassin 9 0 0 0 9 9 9 H=2 assin assassin in n sassin sin ssin ssassin 9 0 0 0 1 1 9 How to find lcp – Hgt[] example: lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1

H=4 ssin assassin assin in n sassin sin ssassin 3 0 0 0 1 1 2 How to find lcp – Hgt[] example (cont.) lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

How to find lcp –data structures We need a data structure that will contain lcp(APos[j], APos[i]) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree.

Suffix Arrays: A new method for on-line string searches

Suffix Arrays: A new method for on-line string searches

Presentation Transcript

Lecture 15

String Theory (HEP2005)

Combinatorial Pattern Matching

Arrays

Chapter 6

Chapter 7 – Arrays

Combinatorial Pattern Matching

Similarity Searches in Sequence Databases

String/Gauge Duality: (re) discovering the QCD String in AdS Space

Supersymmetry searches at colliders

string comments[5]

Class string and String Stream Processing

Arrays

C Arrays

Review of Important Topics in CS1600

Arrays

Arrays and Pointers

Workbook 8 String Processing Tools

Arrays

Chapter 6 - Arrays

Languages

The Next Best String