760 likes | 763 Views
This article provides an overview of algorithms for exact string matching and sequence alignment in the context of genome analysis. It covers topics such as dealing with long sequences, comparing and analyzing genomes, and using suffix data structures. The article also discusses the construction and applications of suffix trees in genome analysis.
E N D
Contents • First week: algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || • Second week: Alignment of sequences. • Edit distance between two strings: dynamic programming • Alignment of sequences: • 2 sequences • 3 or more sequences • Third week: dealing with long sequences.
Dealing with genomes What can be done with a genome or with a chromosome? • Compare it with other genomes. • The distribution of patterns of a given length. • The most frequent patterns of a given length. • Look for the repeats (short and long)
Comparison of genomes What's the meaning?
Comparison of genomes 15 microbial genomes:
Comparison of genomes 2 pyrococus genomes:
… a a t g….c t g... MUM … c g t g….c c c ... MUM Maximal Unique Matching … and parallel MUMs form a CLUSTER
Suffix data structures 1a. Part: Suffix trees Algorithms on strings, trees and sequences, Dan Gusfield Cambridge University Press 2a. Part: Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber
Suffix trees 7: s s,7 s,7 6: as s,6 s,6 5: aas a a as,5 as,5 as,3 as,3 ba ba baas,1 baas,1 ba ba as,4 as,4 baas,2 baas,2 Given string ababaas: Suffixes: 3: abaas 1: ababaas 4: baas 2: babaas What kind of queries?
Applications of Suffix trees s,7 s,6 as,5 a as,3 ba baas,1 ba as,4 baas,2 1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? …………………………
Quadratic insertion algorithm and the suffix-tree Invariant Properties: Given the string …………………………...... …... P1: the leaves of suffixes from have been inserted
Quadratic insertion algorithm ababaabbs,1 Given the string ababaabbs
Quadratic insertion algorithm babaabbs,2 Given the string ababaabbs ababaabbs,1
Quadratic insertion algorithm aba baabbs,1 Given the string ababaabbs ababaabbs,1 babaabbs,2
Quadratic insertion algorithm abbs,3 aba baabbs,1 Given the string ababaabbs babaabbs,2
Quadratic insertion algorithm abbs,3 aba baabbs,1 ba baabbs,2 Given the string ababaabbs babaabbs,2
Quadratic insertion algorithm abbs,3 aba baabbs,1 ba abbs,4 baabbs,2 Given the string ababaabbs
Quadratic insertion algorithm abbs,3 aba a baabbs,1 abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a b abbs,3 a abbs,4 abbs,4 baabbs,1 Given the string ababaabbs ba ba baabbs,2
Quadratic insertion algorithm abbs,5 a bs,6 b abbs,3 a abbs,4 abbs,4 baabbs,1 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a bs,6 b abbs,3 a abbs,4 abbs,4 baabbs,1 Given the string ababaabbs ba baabbs,2
Quadratic insertion algorithm abbs,5 a bs,6 b abbs,3 a bs,7 b baabbs,1 a abbs,4 baabbs,2 Given the string ababaabbs
Quadratic insertion algorithm abbs,5 a bs,6 b abbs,3 a bs,7 b baabbs,1 a abbs,4 s,7 baabbs,2 Given the string ababaabbs
Quadratic insertion algorithm abbs,5 a bs,6 b abbs,3 a bs,7 b baabbs,1 s,7 a abbs,4 s,7 baabbs,2 Given the string ababaabbs
Generalizad suffix tree the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, : The suffix tree of many strings … is called the generalized suffix tree … and it is the suffix tree of the concatenation of strings. For instance,
Generalizad suffix tree abbα,5 a bα,6 b abbα,3 a bα,7 b baabbα,1 α,7 a abbα,4 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: Given the suffix tree of ababaabα:
Generalizad suffix tree abbα,5 a bα,6 b abbα,3 a bα,7 b baabbα,1 α,7 a abbα,4 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ:
Generalizad suffix tree aaβ,1 bα,6 b abbα,3 a bα,7 b baabbα,1 α,7 a abbα,4 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: ab a bα,5
Generalizad suffix tree aaβ,1 bα,6 b abbα,3 a bα,7 b baabbα,1 α,7 a abbα,4 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: ab a bα,5
Generalizad suffix tree β,2 bα,6 bα,7 b α,7 a abbα,4 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 b a bbα,3 a baabbα,1
Generalizad suffix tree β,2 bα,6 bα,7 b α,7 a abbα,4 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 b a bbα,3 a baabbα,1
Generalizad suffix tree bα,6 bα,7 α,7 β,3 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 β,2 b a bbα,3 a b baabbα,1 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 β,3 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 β,2 b a bbα,3 a b baabbα,1 a a bbα,4 baabbα,2
Generalizad suffix tree β,4 bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree β,4 bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2
Generalizad suffix tree β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a bbα,3 a bα,7 b baabbα,1 α,7 β,3 a a bbα,4 α,7 baabbα,2 Generalized suffix tree of ababaabbαaabaaβ:
Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a bbα,3 a bα,7 b baabbα,1 α,7 β,3 a a bbα,4 α,7 baabbα,2 1. The substring problem for a database of strings DB Does the DB contain any ocurrence of patterns abab, aab, and ab?
Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a bbα,3 a bα,7 b baabbα,1 α,7 β,3 a a bbα,4 α,7 baabbα,2 2. The longest common substring of two strings
Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a bbα,3 a bα,7 b baabbα,1 α,7 β,3 a a bbα,4 α,7 baabbα,2 3. Finding MUMs.
Quadratic insertion algorithm and the suffix-tree Invariant Properties: Given the string …………………………...... …... P1: the leaves of suffixes from have been inserted
Linear insertion algorithm and the suffix-tree …... Invariant Properties: Given the string …………………………...... P1: the leaves of suffixes from have been inserted P2: the string is the longest string that can be spelt through the tree.
Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2 Given the string ababaababb...
Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2 Given the string ababaababb... 6 7 8
Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2 Given the string ababaababb... 6 7 8
Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2 Given the string ababaababb... 6 7 89
Linear insertion algorithm: example Given the string ababaababb... 6 7 89 ababb...,5 a ababb...,3 ba ababb...,1 b baababb...,1 baababb...,1 ababb...,4 ba b...,6 baababb...,2