410 likes | 673 Views
Multiple sequence alignment methods. Corné Hoogendoorn Denis Miretskiy. Overview. What a multiple alignment means Scoring a multiple alignment Break Multidimensional dynamic programming Progressive alignment methods. What a multiple alignment means.
E N D
Multiple sequence alignment methods Corné Hoogendoorn Denis Miretskiy Multiple sequence alignment methods
Overview • What a multiple alignment means • Scoring a multiple alignment • Break • Multidimensional dynamic programming • Progressive alignment methods Multiple sequence alignment methods
What a multiple alignment means • Homologous residues are aligned in columns • Structurally homologous • Evolutionarily homologous • Similar 3D structural positions • Diverging from a common ancestral residue Multiple sequence alignment methods
Multiple alignment - issues • Identifying unambiguously homologous positions is not possible • A need to identify which alignment is best • Protein structures and sequences evolve • Sequences not entirely superposable Multiple sequence alignment methods
Multiple alignment - issues • There always is an unambiguously correct evolutionary alignment • Common ancestral sequence • Sheerly impossible to infer the evolutionary history • Usually easier to construct a structural alignment Multiple sequence alignment methods
Multiple alignment - issues • Sequence diverges even faster than structure • Structurally unalignable protein parts cannot be aligned by sequence either • Some parts are very well alignable • Use these parts to align whatever can be aligned • Disregard the rest to assess alignment quality • Supposedly meaningless biases are omitted Multiple sequence alignment methods
Scoring an alignment • Some positions are more conserved than others • Position-specific scoring • Sequences are not independent • Related to each other by a phylogenetic tree • Specify a complete probabilistic model of molecular sequence evolution Multiple sequence alignment methods
Complete probabilistic model • Probabilities of all evolutionary events • Prior probability of root ancestral sequence • Probabilities of evolutionary change depend on evolutionary time • Position-specific structural and functional constraints • We just don’t have all the necessary data Multiple sequence alignment methods
Workable approximations • Assume that all columns are statistically independent Score for multiple alignment m Gap score/penalty Score for column i in the multiple alignment m Multiple sequence alignment methods
Scoring an alignment • Notations Multiple sequence alignment methods
Minimum Entropy:Further simplification • We already assumed independence between columns • Complex statistical dependence between sequences (within columns) if their phylogenetic tree has many intermediate ancestors • We assume independence between and within columns Multiple sequence alignment methods
Minimum entropy • Probability of column mi • Score of column mi can be defined as the negative logarithm A regularized probability estimate as used in chapter 5 An entropy measure directly related to the Shannon entropy (chapter 11) Multiple sequence alignment methods
Example (1) Multiple sequence alignment methods
Example (2) Multiple sequence alignment methods
Example (3) Will this ever be 0 in reality? Why (not)? Multiple sequence alignment methods
Example (4) Multiple sequence alignment methods
Minimum entropy • Very near to the HMM formulation • Choose the sequences carefully • Usually the sample of sequences is biased • Weighting schemes as discussed in chapter 5 are necessary • This partially compensates for the defects of the assumption of sequence independence Multiple sequence alignment methods
Sum of pairs • Also assumes statistical independence between columns • Uses substitution matrices • For simple linear gap costs, s(a,-) s(-,a) and s(-,-) are defined, with s(-,-) = 0 Scores s(a,b) come from substitution matrices like PAM or BLOSUM Multiple sequence alignment methods
Sum of pairs • Substitution scores are usually log-odds scores for pairwise comparisons • log(pab/qaqb) + log(pbc/qbqc) + log(pac/qaqc) • log(pabc/qaqbqc) • Each sequence is scored as if it descended from the N-1 other sequences • Evolutionary events are over-counted Multiple sequence alignment methods
Problem with SP scores • Consider an alignment of N sequences • All have leucine (L) at position i Number of symbol pairs in the column Score for an L-L alignment according to the BLOSUM50 matrix Multiple sequence alignment methods
Problem with SP scores • What if one sequence has glycine (G) at i? • G-L pair scores -4, difference with L-L is 9 • The score is worse than the all-leucine column by a fraction Multiple sequence alignment methods
What a multiple alignment meansScoring a multiple alignment Questions? Break Multiple sequence alignment methods
Multidimensional dynamic programming • We assume that columns of an alignment are statistically independent • Gaps are scored with a linear gap cost • Now we can calculate overall score S(m) Where S(mi) is a score for column i Multiple sequence alignment methods
Define as the maximum score of an alignment up to the subsequences ending with Calculating the overall score Multiple sequence alignment methods
Simple notation • Introduce Di which is 0 or 1 and define the “product” • Now recursion can be written as follows Multiple sequence alignment methods
Complexity of algorithm • The algorithm requires the computation of the whole dynamic programming matrix with L1, L2,…,LN entries. • We have to view 2N - 1 combinations of gaps in a column. • All sequences have roughly the same length • Memory complexity of algorithm is • Time complexity is Multiple sequence alignment methods
MSA • Let akl denote the pairwise alignment between sequences k and l • the score of the complete alignment is given • Let âkl be the optimal pairwise alignment of k, l • Obviously Multiple sequence alignment methods
Lower bound • Assume that we have a lower bound of the optimal multiple alignment, so • In other words • Where Multiple sequence alignment methods
Lower bound • Now we can look only at pairwise alignments of k and l that score better bkl • We need to obtain s(a), and this can be done by using a progressive alignment algorithm Multiple sequence alignment methods
Restricted algorithm • For each pair k, l we can find the complete set Bkl of coordinate pairs (ik, il) such that the best alignment of xk to xl through (ik, il) scores more than bkl • Now we only have to look at cells (i1, i2,…, iN) which meet the following condition: • (ik, il) is in Bkl for all k, l Multiple sequence alignment methods
Progressive alignment methods • The algorithms differ in several ways • Choice of order to do the alignment • Whether the progression involves only alignment of sequences to a single growing alignment or whether subfamilies are built upon a tree structure Multiple sequence alignment methods
Feng-Doolittle progressive multiple alignment • Calculate a diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment • Construct a guide tree from the distance matrix using the Fitch&Margoliash clustering algorithm • Starting from the first node added to the tree, align the child nodes Repeat until all sequences have been aligned. Multiple sequence alignment methods
Converting scores to distances Where Smax is the maximum score Sobs is the observed pairwise alignment score Srand is the expected score for aligning two random sequences Multiple sequence alignment methods
Profile alignment • Linear gap scores can be included in the SP score: • Global alignment score: Multiple sequence alignment methods
CLUSTALW progressive alignment • Construct a distance matrix of all N(N-1)/2 pair by pairwise dynamic programming alignment. • Construct a guide tree by a neighbor-joining clustering algorithm (Saitou & Nei). • Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile and profile-profile alignment. Multiple sequence alignment methods
CLUSTALW properties • Sequences are weighted to compensate for biased representation. • The substitution matrix used to score an alignment is chosen based on the expected similarity of the sequences • Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position. Multiple sequence alignment methods
CLUSTALW properties • Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. • Both gap-open and gap-extend penalties are increased if there are also no gaps occur nearby in the alignment. • In the progressive alignment stage, if the score of an alignment is low, we have to accumulate profile information Multiple sequence alignment methods
Iterative refinement methods:Barton-Stenberg multiple alignment • Find two sequences with the highest pairwise similarity and align them using standard pairwise dynamic programming alignment. • Find the sequence that is most similar to a profile of the alignment of the first two and align it to the first two by profile-sequence alignment. Repeat until all sequences have been included in the multiply alignment. Multiple sequence alignment methods
Iterative refinement methods:Barton-Stenberg multiple alignment • Remove sequence and realign it to a profile of the other aligned sequences by profile-sequence alignment. Repeat for sequences. • Repeat the previous realignment step a fixed number of times or until the alignment score converges. Multiple sequence alignment methods