Suffix Tree and Array: Exact Matching Solutions

Suffix Tree and Suffix Array

Outline • Motivation • Exact Matching Problem • Suffix Tree • Building issues • Suffix Array • Build • Search • Longest common prefixes • Extra topics discussion • Suffix Tree VS. Suffix Array

Motivation • Text search • Need fast searching algorithm(with low space cost) • DNA sequences and protein sequences are too large to search by traditional algorithms • Some improved algorithms perform efficiently • KMP, BM algorithms for string matching • Suffix Tree with linear construction and searching time • Suffix Array with Suffix Tree based construction

Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/

Exact Matching Problem • Find ‘ssi’ in ‘mississippi’

Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ si s

Exact Matching Problem • Find ‘ssi’ in ‘mississippi’ ssippi si s Every leaf below this point in the tree marks the starting location of ‘ssi’ in ‘mississippi’. (ie. ‘ssissippi’ and ‘ssippi’)

Exact Matching Problem • Find ‘sissy’ in ‘mississippi’

Exact Matching Problem • Find ‘sissy’ in ‘mississippi’ s i ss

Exact Matching Problem • So what? Knuth-Morris-Pratt and Boyer-Moore both achieve this worst case bound. • O(m+n) when the text and pattern are presented together. • Suffix trees are much faster when the text is fixed and known first while the patterns vary. • O(m) for single time processing the text, then only O(n) for each new pattern. • Aho-Corasick is faster for searching a number of patterns at one time against a single text.

Boyer-Moore Algorithm • For string matching(exact matching problem) • Time complexity O(m+n) for worst case and O(n/m) for absense • Method: backward matching with 2 jumping arrays(bad character table and good suffix table)

What are suffix arrays and trees? • Text indexing data structures • not word based • allow search for patterns or • computation of statistics • Important Properties • Size • Speed of exact matching • Space required for construction • Time required for construction

Suffix Tree

Properties of a Suffix Tree • Each tree edge is labeled by a substring of S. • Each internal node has at least 2 children. • Each S(i) has its corresponding labeled path from root to a leaf, for 1in . • There are n leaves. • No edges branching out from the same internal node can start with the same character.

Building the Suffix Tree • How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree

Building the Suffix Tree • papua

Building the Suffix Tree • papua papua

Building the Suffix Tree • papua apua papua

Building the Suffix Tree • papua apua p apua ua

Building the Suffix Tree • papua apua p apua ua ua

Building the Suffix Tree • papua pua a p apua ua ua

Building the Suffix Tree • How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree Naïve method - O(m2) (m = text size)

Building the Suffix Tree in O(m) Time • In the previous example, we assumed that the tree can be built in O(m) time. • Weiner showed original O(m) algorithm (Knuth is claimed to have called it “the algorithm of 1973”) • More space efficient algorithm by McCreight in 1976 • Simpler ‘on-line’ algorithm by Ukkonen in 1995

Ukkonen’s Algorithm • Build suffix tree T for string S[1..m] • Build the tree in m phases, one for each character. At the end of phase i, we will have tree Ti, which is the tree representing the prefix S[1..i]. • In each phase i, we have i extensions, one for each character in the current prefix. At the end of extension j, we will have ensured that S[j..i] is in the tree Ti. NTHU Make Lab http://make.cs.nthu.edu.tw

Ukkonen’s Algorithm • 3 possible ways to extend S[j..i] with character i+1. • S[j..i] ends at a leaf. Add the character i+1 to the end of the leaf edge. • There is a path through S[j..i], but no match for the i+1 character. Split the edge and create a new node if necessary, then add a new leaf with character i+1. • There is already a path through S[j..i+1]. Do nothing.

Ukkonen’s Algorithm - mississippi

Ukkonen’s Algorithm • In the form just presented, this is an O(m3) time, O(m2) space algorithm. • We need a few implementation speed-ups to achieve the O(m) time and O(m) space bounds.

Suffix Array

The Suffix Array Definition:Given a string D thesuffix arraySA for this string is the sorted list of pointers to all suffixes of D. (Manber, Myers 1990)

中山大學資工系--楊昌彪教授 The Suffix Array http://par.cse.nsysu.edu.tw/~cbyang/ • In a suffix array, all suffixes of S are in the non-decreasing lexical order. • For example, S=“ATCACATCATCA”

fin

How do we build it ? • Build a suffix tree • Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. • O(n) time • Suffix tree construction loses some of the advantage that the suffix array has over the suffix tree

Direct suffix array construction algorithm • Unfortunately, it is difficult to solve this problem with the suffix array Pos alone because Pos has lost the information on tree topology. In direct algorithm, the array Height (saving lcp information) has the information on the tree topology which is lost in the suffix array P “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications”

Skew-algorithm • Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. • Step 2: SA= 0= sort the suffixes starting at position i = 0 mod 3. • Step 3: SA = merge SA= 0and SA≠ 0. 0 12 3 45 6 78 9 10 s = m i s s i s s i p p i

Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 11 12 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i $ $ m i s s i s s i p p i Radixsort 3 3 2 1 5 5 4 1 4 7 10 2 5 8 Let S12 = [ 3 3 2 1 5 5 4 ] => SA≠0= [ 10 7 4 1 8 5 2 ] in T(2n/3)

1 4 7 102 5 8 s12= [ 3 3 2 1 5 5 4 ] s121= 3 3 2 1 5 5 4 s124 = 3 2 1 5 5 4 s127 = 2 1 5 5 4 s1210 = 1 5 5 4 s122 = 5 5 4 S125= 5 4 s128 = 4 s = m i s s i s s i p p i s1 = i s s i s s i p p i s4 = i s s i p p i s7= i p p i s10 = i s2= s s i s s i p p i s5= s s i p p i s8= p p i SA≠ 0 = [ 107418 5 2 ], It suffices to show that S12i < S12j <=> si < sj.

Suffix Tree and Array: Exact Matching Solutions

Suffix Tree and Array: Exact Matching Solutions

Presentation Transcript

Suffix trees and suffix arrays

Suffix Trees

Suffix tree and suffix array techniques for pattern analysis in strings

Pattern Matching: Suffix Tree Applications

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Suffix - ly

Suffix Trees and Suffix Arrays

Suffix Trees

Suffix Tree

Suffix tree and suffix array techniques for pattern analysis in strings

Suffix arrays

Suffix Trees

Compressed Suffix Arrays and Suffix Trees

SUFFIX TREES

Simple Linear Work Suffix Array Construction

Suffix Trees and Suffix Arrays

Faster Suffix Tree Construction With Missing Suffix Links

Trie/Suffix Trie/Suffix Tree

Suffix Trees

Suffix Trees

Trie/Suffix Trie/Suffix Tree