190 likes | 639 Views
Multiple Sequence Alignment. Dynamic Programming. Multiple Sequence Alignment. VTISCTGSSSNIGAG NHVKWYQQLPG VTISCTGTSSNIGS ITVNWYQQLPG LRLSCSSSGFIFSS YAMYWVRQAPG LSLTCTVSGTSFDD YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGA VTVAWKADS ATLVCLISDFYPGA VTVAWKADS
E N D
Multiple Sequence Alignment Dynamic Programming
Multiple Sequence Alignment VTISCTGSSSNIGAGNHVKWYQQLPG VTISCTGTSSNIGSITVNWYQQLPG LRLSCSSSGFIFSSYAMYWVRQAPG LSLTCTVSGTSFDDYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG ATLVCLISDFYPGAVTVAWKADS ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG- VSLTCLVKGFYPSDIAVEWESNG- • Goal: Bring the greatest number of similar characters into the same column of the alignment • Similar to alignment of two sequences.
CLUSTALW MSA MSA of four oxidoreductase NAD binding domain protein sequences. Red: AVFPMILW. Blue: DE. Magenta: RHK. Green: STYHCNGQ. Grey: all others. Residue ranges are shown after sequence names. Chenna et al. Nucleic Acids Research, 2003, Vol. 31, No. 13 3497-3500
Multiple Sequence Alignment: Motivation • Correspondence. Find out which parts “do the same thing” • Similar genes are conserved across widely divergent species, often performing similar functions • Structure prediction • Use knowledge of structure of one or more members of a protein MSA to predict structure of other members • Structure is more conserved than sequence • Create “profiles” for protein families • Allow us to search for other members of the family • Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs • MSA is the starting point for phylogenetic analysis
Multiple Sequence Alignment: Approaches • Optimal Global Alignments -Dynamic programming • Generalization of Needleman-Wunsch • Find alignment that maximizes a score function • Computationally expensive: Time grows as product of sequence lengths • Global Progressive Alignments - Match closely-related sequences first using a guide tree • Global Iterative Alignments - Multiple re-building attempts to find best alignment • Local alignments • Profiles, Blocks, Patterns
Scoring a multiple alignment A A A A C A C A C A C C A C A Sum of pairs Star Tree
A AAA AAA AAA AAC ACC A C A A A A A A A C 10α + (6α - 4β) + (4α - 6β) A A A C Sum of Pairs = 20α - 10β
Sum-of-Pairs Scoring Function Score of multiple alignment = ∑i <j score(Si,Sj) where score(Si,Sj) = score of induced pairwise alignment
Induced Pairwise Alignment S1 S - T I S C T G - S - N I S2 L - T I – C N G S S - N I S3 L R T I S C S G F S Q N I Induced pairwise alignment of S1,S2: S1 S T I S C T G - S N I S2 L T I – C N G S S N I
MSA: Dynamic Programming • The two-sequence alignment algorithm can be generalized to any number of sequences. • E.g., for three sequences X, Y, W defineC[i,j,k] = score of optimum alignment among X[1..i], Y[1..j], W[1..k] • As for two sequences, divide possible alignments into different classes, depending on how they end. • Use to devise recurrence relations for C[i,j,k] • C[i,j,k] is the maximum out of all possibilities
MSA: 7 ways alignment can end for 3 sequences Xi Yj Wk X1 . . . Xi-1 Xi Y1 . . . Yj-1 Yj W1 . . . Wk-1 Wk - Yj Wk Xi - Wk Xi - - Xi Yj - - Yj - - - Wk
V S N — S — S N A — — — — A S Dynamic programming for three sequences Each alignment is a path through the dynamic programming matrix S A A N S V S N S Start
For 3 seqs. of length n, time is proportional to n3 Dynamic Programming for Three Sequences There are 7 ways to get to C[i,j,k] C[i,j,k] C[i-1,j,k-1] C[i-1,j-1,k-1] C[i-1,j,k-1] Enumerate all possibilities and choose the best one
Dynamic Programming MSA: General Case • For k sequences of length n, dynamic programming algorithm does (2k-1)nkoperations • Example: 6 sequences of length 100 require6.4X1013 calculations • Space for table is nk • Implementations (e.g., WashU MSA 2.1) use tricks and only search subset of dynamic programming table • Even this is expensive. E.g., Baylor CM Search launcher limits MSA to 8 sequences of 800 characters and 10 minutes processing time
Problems with SP scoring • Pair-wise comparisons can over-score evolutionarily distant pairs. • Reason: For 3 or more sequences, SP scoring does not correspond to any evolutionary tree But not:
Overcoming problems with SP scoring • Use weights to incorporate evolution in sum of pairs scoring: • Some pair-wise alignments are more important than others • E.g., more important to have a good alignment between mouse and human sequences than mouse and bird • Assign different weights to different pair-wise alignments. • Weight decreases with evolutionary distance. • Use star tree approach • one sequence is assigned as the ancestor and all others are contrasted it.
Star Alignments • Construct multiple alignments using pair-wise alignment relative to a fixed sequence • Out of a set S = {S1, S2, . . . , Sr} of sequences, pick sequence Sc that maximizesstar_score(c) = ∑ {sim(Sc, Si) : 1 ≤ i ≤ r, i ≠ c}where sim(Si, Sj) is the optimal score of a pair-wise alignment between Si and Sj
Algorithm • Compute sim(Si, Sj) for every pair (i,j) • Compute star_score(i) for every i • Choose the index c that minimizes star_score(c) and make it the center of the star • Produce a multiple alignment M such that, for every i, the induced pairwise alignment of Sc and Si is the same as the optimum alignment of Sc and Si.
Step 4: Detail ScA-ACC-TT S2AGACCGT- ScAA--CCTT S1AATGCC-- ScA-A--CC-TT S1A-ATGCC--- S2AGA--CCGT-