190 likes | 308 Views
BNFO 602 Multiple sequence alignment. Usman Roshan. Optimal pairwise alignment.
E N D
BNFO 602Multiple sequence alignment Usman Roshan
Optimal pairwise alignment • Sum of pairs (SP) optimization: find the alignment of two sequences that maximizes the similarity score given an arbitrary cost matrix. We can find the optimal alignment in O(mn) time and space using the Needleman-Wunsch algorithm. • Recursion: Traceback: where M(i,j) is the score of the optimal alignment of x1..i and y1..j, s(xi,yj) is a substitution scoring matrix, and g is the gap penalty
Affine gap penalties • Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps) • Alignment: • ACACCCT ACACCCC • ACCT TAC CTT • Score = 0 Score = 0.9 • Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1
Affine penalty recursion M(i,j) denotes alignments of x1..i and y1..j ending with a match/mismatch. E(i,j) denotes alignments of x1..i and y1..j such that yj is paired with a gap. F(i,j) defined similarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.
Multiple sequence alignment • “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk • Computationally very hard---NP-hard
Unaligned sequences GGCTT TAGGCCTT TAGCCCTTA ACACTTC ACTT Aligned sequences _G_ _ GCTT_ TAGGCCTT_ TAGCCCTTA A_ _CACTTC A_ _C_ CTT_ Conserved regions help us to identify functionality Multiple sequence alignment
What is the sum of pairs score of this alignment? Sum of pairs score
Profile • A profile can be described by a set of vectors of nucleotide/residue frequencies. • For each position i of the alignment, we we compute the normalized frequency of nucleotides A, C, G, and T
Aligning a profile vector to a nucleotide • ClustalW/MUSCLE • Let f be the profile vector • Score(f,j)= • where S(i,j) is substitution scoring matrix
Iterative alignment(heuristic for sum-of-pairs) • Pick a random sequence from input set S • Do (n-1) pairwise alignments and align to closest one t in S • Remove t from S and compute profile of alignment • While sequences remaining in S • Do |S| pairwise alignments and align to closest one t • Remove t from S
Iterative alignment • Once alignment is computed randomly divide it into two parts • Compute profile of each sub-alignment and realign the profiles • If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit
Progressive alignment • Idea: perform profile alignments in the order dictated by a tree • Given a guide-tree do a post-order search and align sequences in that order • Widely used heuristic
Expected accuracy alignment • The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. • We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.
Posterior probability of xi aligned to yj • Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. • We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005
Expected accuracy of alignment • We can define the expected accuracy of an alignment a as • The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005
Example for expected accuracy • True alignment • AC_CG • ACCCA • Expected accuracy=(1+1+0+1+1)/4=1 • Estimated alignment • ACC_G • ACCCA • Expected accuracy=(1+1+0.1+0+1) ~ 0.75
Estimating posterior probabilities • If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data • PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) • Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices
Benchmarking alignment programs • http://nar.oxfordjournals.org/content/38/15/4917.abstract