940 likes | 1.01k Views
Explore exact matching solutions with Suffix Trees and Suffix Arrays for efficient text search, especially for large DNA or protein sequences. Learn advanced algorithms like KMP and BM for string matching.
E N D
Outline • Motivation • Exact Matching Problem • Suffix Tree • Building issues • Suffix Array • Build • Search • Longest common prefixes • Extra topics discussion • Suffix Tree VS. Suffix Array
Motivation • Text search • Need fast searching algorithm(with low space cost) • DNA sequences and protein sequences are too large to search by traditional algorithms • Some improved algorithms perform efficiently • KMP, BM algorithms for string matching • Suffix Tree with linear construction and searching time • Suffix Array with Suffix Tree based construction
Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/
Exact Matching Problem • Find ‘ssi’ in ‘mississippi’
Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ si s
Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ ssippi si s Every leaf below this point in the tree marks the starting location of ‘ssi’ in ‘mississippi’. (ie. ‘ssissippi’ and ‘ssippi’)
Exact Matching Problem • Find ‘sissy’ in ‘mississippi’
Exact Matching Problem • Find ‘sissy’ in ‘mississippi’
Exact Matching Problem • Find ‘sissy’ in ‘mississippi’ s i ss
Exact Matching Problem • Find ‘sissy’ in ‘mississippi’ s i ss
Exact Matching Problem • So what? Knuth-Morris-Pratt and Boyer-Moore both achieve this worst case bound. • O(m+n) when the text and pattern are presented together. • Suffix trees are much faster when the text is fixed and known first while the patterns vary. • O(m) for single time processing the text, then only O(n) for each new pattern. • Aho-Corasick is faster for searching a number of patterns at one time against a single text.
Boyer-Moore Algorithm • For string matching(exact matching problem) • Time complexity O(m+n) for worst case and O(n/m) for absense • Method: backward matching with 2 jumping arrays(bad character table and good suffix table)
What are suffix arrays and trees? • Text indexing data structures • not word based • allow search for patterns or • computation of statistics • Important Properties • Size • Speed of exact matching • Space required for construction • Time required for construction
Properties of a Suffix Tree • Each tree edge is labeled by a substring of S. • Each internal node has at least 2 children. • Each S(i) has its corresponding labeled path from root to a leaf, for 1in . • There are n leaves. • No edges branching out from the same internal node can start with the same character.
Building the Suffix Tree • How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree
Building the Suffix Tree • papua
Building the Suffix Tree • papua papua
Building the Suffix Tree • papua apua papua
Building the Suffix Tree • papua apua p apua ua
Building the Suffix Tree • papua apua p apua ua ua
Building the Suffix Tree • papua pua a p apua ua ua
Building the Suffix Tree • papua pua a p apua ua ua
Building the Suffix Tree • How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree Naïve method - O(m2) (m = text size)
Building the Suffix Tree in O(m) Time • In the previous example, we assumed that the tree can be built in O(m) time. • Weiner showed original O(m) algorithm (Knuth is claimed to have called it “the algorithm of 1973”) • More space efficient algorithm by McCreight in 1976 • Simpler ‘on-line’ algorithm by Ukkonen in 1995
Ukkonen’s Algorithm • Build suffix tree T for string S[1..m] • Build the tree in m phases, one for each character. At the end of phase i, we will have tree Ti, which is the tree representing the prefix S[1..i]. • In each phase i, we have i extensions, one for each character in the current prefix. At the end of extension j, we will have ensured that S[j..i] is in the tree Ti. NTHU Make Lab http://make.cs.nthu.edu.tw
Ukkonen’s Algorithm • 3 possible ways to extend S[j..i] with character i+1. • S[j..i] ends at a leaf. Add the character i+1 to the end of the leaf edge. • There is a path through S[j..i], but no match for the i+1 character. Split the edge and create a new node if necessary, then add a new leaf with character i+1. • There is already a path through S[j..i+1]. Do nothing.
Ukkonen’s Algorithm • In the form just presented, this is an O(m3) time, O(m2) space algorithm. • We need a few implementation speed-ups to achieve the O(m) time and O(m) space bounds.
The Suffix Array Definition:Given a string D thesuffix arraySA for this string is the sorted list of pointers to all suffixes of D. (Manber, Myers 1990)
中山大學資工系--楊昌彪教授 The Suffix Array http://par.cse.nsysu.edu.tw/~cbyang/ • In a suffix array, all suffixes of S are in the non-decreasing lexical order. • For example, S=“ATCACATCATCA”
How do we build it ? • Build a suffix tree • Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. • O(n) time • Suffix tree construction loses some of the advantage that the suffix array has over the suffix tree
Direct suffix array construction algorithm • Unfortunately, it is difficult to solve this problem with the suffix array Pos alone because Pos has lost the information on tree topology. In direct algorithm, the array Height (saving lcp information) has the information on the tree topology which is lost in the suffix array P “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications”
Skew-algorithm • Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. • Step 2: SA= 0= sort the suffixes starting at position i = 0 mod 3. • Step 3: SA = merge SA= 0and SA≠ 0. 0 12 3 45 6 78 9 10 s = m i s s i s s i p p i
Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 11 12 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i $ $ m i s s i s s i p p i Radixsort 3 3 2 1 5 5 4 1 4 7 10 2 5 8 Let S12 = [ 3 3 2 1 5 5 4 ] => SA≠0= [ 10 7 4 1 8 5 2 ] in T(2n/3)
1 4 7 102 5 8 s12= [ 3 3 2 1 5 5 4 ] s121= 3 3 2 1 5 5 4 s124 = 3 2 1 5 5 4 s127 = 2 1 5 5 4 s1210 = 1 5 5 4 s122 = 5 5 4 S125= 5 4 s128 = 4 s = m i s s i s s i p p i s1 = i s s i s s i p p i s4 = i s s i p p i s7= i p p i s10 = i s2= s s i s s i p p i s5= s s i p p i s8= p p i SA≠ 0 = [ 107418 5 2 ], It suffices to show that S12i < S12j <=> si < sj.