790 likes | 808 Views
Explore exact and approximate string matching techniques, suffix trees, and generalized suffix trees. Understand algorithms for pattern matching and sequence assembly using modern information retrieval methods.
E N D
Recuperació de la informació • Modern Information Retrieval (1999) • Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • Algorithms on strings (2001) • M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models
Index 1a. Part: Suffix trees Algorithms on strings, trees and sequences, Dan Gusfield Cambridge University Press 2a. Part: Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber
Suffix trees 7: s s,7 6: as s,6 5: aas a a as,5 as,3 as,3 ba ba baas,1 baas,1 ba ba as,4 as,4 baas,2 baas,2 s,7 s,6 as,5 Given string ababaas: Suffixes: 3: abaas 1: ababaas 4: baas 2: babaas What kind of queries?
Applications of Suffix trees a ba baas,1 ba as,3 baas,2 as,4 s,7 s,6 as,5 1. Exact string matching • Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? …………………………
Quadratic insertion algorithm and the suffix-tree Invariant Properties: Given the string …………………………...... …... P1: the leaves of suffixes from have been inserted
Quadratic insertion algorithm ababaabbs,1 Given the string ababaabbs
Quadratic insertion algorithm babaabbs,2 Given the string ababaabbs ababaabbs,1
Quadratic insertion algorithm aba baabbs,1 Given the string ababaabbs ababaabbs,1 babaabbs,2
Quadratic insertion algorithm abbs,3 aba baabbs,1 Given the string ababaabbs babaabbs,2
Quadratic insertion algorithm abbs,3 aba baabbs,1 ba baabbs,2 Given the string ababaabbs babaabbs,2
Quadratic insertion algorithm abbs,3 aba baabbs,1 ba abbs,4 baabbs,2 Given the string ababaabbs
Quadratic insertion algorithm abbs,3 a aba baabbs,1 abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a abbs,3 b a baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba ba baabbs,2
Quadratic insertion algorithm abbs,5 a bs,6 abbs,3 b a baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a bs,6 abbs,3 b a baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a bs,6 bs,7 abbs,3 b a baabbs,1 b abbs,4 a baabbs,2 Given the string ababaabbs
Quadratic insertion algorithm abbs,5 a bs,6 bs,7 abbs,3 b a baabbs,1 b abbs,4 a s,7 baabbs,2 Given the string ababaabbs
Quadratic insertion algorithm abbs,5 a bs,6 bs,7 abbs,3 b a baabbs,1 b abbs,4 a s,7 s,7 baabbs,2 Given the string ababaabbs
Generalizad suffix tree the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, : The suffix tree of many strings … is called the generalized suffix tree … and it is the suffix tree of the concatenation of strings. For instance,
Generalizad suffix tree abbα,5 a bα,6 bα,7 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: Given the suffix tree of ababaabα:
Generalizad suffix tree abbα,5 a bα,6 bα,7 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ:
Generalizad suffix tree aaβ,1 bα,7 bα,6 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: ab a bα,5
Generalizad suffix tree aaβ,1 bα,7 bα,6 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: ab a bα,5
Generalizad suffix tree β,2 bα,7 bα,6 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 b a bbα,3 a baabbα,1
Generalizad suffix tree β,2 bα,7 bα,6 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 b a bbα,3 a baabbα,1
Generalizad suffix tree bα,7 bα,6 β,3 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 β,2 b a bbα,3 a b baabbα,1 a a bbα,4 baabbα,2
Generalizad suffix tree bα,7 bα,6 β,3 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 β,2 b a bbα,3 a b baabbα,1 a a bbα,4 baabbα,2
Generalizad suffix tree β,4 bα,7 bα,6 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree β,4 bα,7 bα,6 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,6 bα,7 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 Generalized suffix tree of ababaabbαaabaaβ:
Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,7 bα,6 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 1. The substring problem for a database of strings DB • Does the DB contain any ocurrence of patterns abab, aab, and ab?
Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,7 bα,6 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 2. The longest common substring of two strings  
Definition of MUM … a a t g….c t g... MUM … c g t g….c c c ... Maximal Unique Matching
Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,7 bα,6 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 3. Finding MUMs.
Quadratic insertion algorithm and the suffix-tree Invariant Properties: Given the string …………………………...... …... P1: the leaves of suffixes from have been inserted
Linear insertion algorithm and the suffix-tree …... Invariant Properties: Given the string …………………………...... P1: the leaves of suffixes from have been inserted P2: the string is the longest string that can be spelt through the tree.
Linear insertion algorithm: example ababab...,5 a ababab...,3 ba baababab...,1 ba ababab...,4 baababab...,2 Given the string ababaababb...
Linear insertion algorithm: example ababab...,5 a ababab...,3 ba baababab...,1 ba ababab...,4 baababab...,2 Given the string ababaababb... 6 7 8
Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2 Given the string ababaababb... 6 7 8
Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2 Given the string ababaababb... 6 7 89
Linear insertion algorithm: example Given the string ababaababb... 6 7 89 ababb...,5 a ababb...,3 ba ababb...,1 b baababb...,1 baababb...,1 ababb...,4 ba b...,6 baababb...,2
Linear insertion algorithm: example Given the string ababaababb... 7 89 ababb...,5 a ababb...,3 ba ababb...,1 b ababb...,4 ba b...,6 baababb...,2
Linear insertion algorithm: example Given the string ababaababb... 7 89 a ababb...,5 a ababb...,3 ba ababb...,1 b ababb...,4 ba b...,6 baababb...,2
Linear insertion algorithm: example Given the string ababaababb... 7 89 a ababb...,1 b b...,6 ababb...,5 ababb...,3 ba ababb...,4 ba baababb...,2