1 / 28

Multiple sequence alignment Arthur W. Chou Fall, 2005

Multiple sequence alignment Arthur W. Chou Fall, 2005. Multiple sequence alignment: definition. Given: • Set of sequences • Similarity score matrix • Gap penalties Find: Alignment of sequences such that optimal score is achieved.

lydie
Download Presentation

Multiple sequence alignment Arthur W. Chou Fall, 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple sequence alignment Arthur W. Chou Fall, 2005

  2. Multiple sequence alignment: definition Given: • Set of sequences • Similarity score matrix • Gap penalties Find: Alignment of sequences such that optimal score is achieved. Result: a collection of three or more protein or nucleic acid sequences that are partially or completely aligned, such that homologous residues are aligned in columns across the length of the sequences.

  3. Why do we care about protein MA? • Useful way to summarize the sequences of related proteins. • What do globin sequences look like? • 4mbn . ----------VLSEGEWQLVLHVWAKVE--ADVAGH • 1myt . --------------ADFDAVLKCWGPVE--ADYTTM • 2hhb A ----------VLSPADKTNVKAAWGKVG--AHAGEY • 2mhb A ----------VLSAADKTNVKAAWSKVG--GHAGEY • 1pbx A ----------SLSDKDKAAVRALWSKIG--KSADAI • 2hhb B ---------VHLTPEEKSAVTALWGKV----NVDEV • 2mhb B ---------VQLSGEEKAAVLALWDKV----NEEEV • 2lhb . -PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETS • 1mba . ----------SLSAAEADLAGKSWAPVFA--NKNAN • 1sdh A --PSVYDAAAQLTADVKKDLRDSWKVIGS--DKKGN • 1lh1 . ---------GALTESQAALVKSSWEEFN--ANIPKH • 1hlb . GGTLAIQAQGDLTLAQKKIVRKTWHQLMRN--KTSF • 1ith A ----------GLTAAQIKAIQDHWFLNI-KGCLQAA • 1ecd . -----------LSADQISTVQASFDKVK------GD • 2hbg . ----------GLSAAQRQVIAATWKDIAGADNGAGV

  4. Why do we care about protein MA? 2. Useful way to find important functional amino acids by assessing conservation over many sequences. What is conserved? DRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKG PKFAGI-AQADIAGNAAISAHGATVLKKLGELLKAKG PHF-DLSH-----GSAQVKGHGKKVADALTNAVAHVD PHF-DLSH-----GSAQVKAHGKKVGDALTLAVGHLD SHWPDVTP-----GSPHIKAHGKKVMGGIALAVSKID ESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLD DSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLD PKFKGLTTADELKKSADVRWHAERIINAVDDAVASMD ADFKGKSVAD-IKASPKLRDVSSRIFTRLNEFVNNAA KRLGNVS---QGMANDKLRGHSITLMYALQNFIDQLD SFLKGT--SEVPQNNPELQAHAGKVFKLVYEAAIQLE PQMAGM-SASQLRSSRQMQAHAIRVSSIMSEYVEELD HKFS-SVPLYGLRSNPAYKAQTLTVINYLDKVVDALG TQFAG-KDLESIKGTAPFETHANRIVGFFSKIIGELP GFSGA--------SDPGVAALGAKVLAQIGVAVSHLG

  5. Why do we care about protein MA? 3. Establish evolutionary relationships between sequences. What was sequence of events leading to current species? 4mbn .EAIIHVLHSRHPGDFGADAQGAMNKA 1myt .EVLVKVMHEKAGLD--AGGQTALRNV 2hhb AHCLLVTLAAHLPAEFTPAVHASLDKF 2mhb AHCLLSTLAVHLPNDFTPAVHASLDKF 1pbx AHCILVVISTMFPKEFTPEAHVSLDKF 2hhb BNVLVCVLAHHFGKEFTPPVQAAYQKV 2mhb BNVLVVVLARHFGKDFTPELQASYQKV 2lhb .AVIADTVAAG---------DAGFEKL 1mba .SMFPGFVASVAA--PPAGADAAWTKL 1sdh AGPIKKVLASK---NFGDKYANAWAKL 1lh1 .EAILKTIKEVVGAKWSEELNSAWTIA 1hlb .MEALQAELGSD---FNEKTRDAWAKA 1ith AKLVGGVFQEE--FSADPTTVAAWGDA 1ecd .AGFVSYMKAHT--DF-AGAEAAWGAT 2hbg .ASLLSAMEHRIGGKMNAAAKDAWAAA

  6. Why do we care about protein MA? 4. More precisely understand how to model 3D structures. What other amino acids are acceptable in this structure? 4mbn .EAIIHVLHSRHPGDFGADAQGAMNKA 1myt .EVLVKVMHEKAGLD--AGGQTALRNV 2hhb AHCLLVTLAAHLPAEFTPAVHASLDKF 2mhb AHCLLSTLAVHLPNDFTPAVHASLDKF 1pbx AHCILVVISTMFPKEFTPEAHVSLDKF 2hhb BNVLVCVLAHHFGKEFTPPVQAAYQKV 2mhb BNVLVVVLARHFGKDFTPELQASYQKV 2lhb .AVIADTVAAG---------DAGFEKL 1mba .SMFPGFVASVAA--PPAGADAAWTKL 1sdh AGPIKKVLASK---NFGDKYANAWAKL 1lh1 .EAILKTIKEVVGAKWSEELNSAWTIA 1hlb .MEALQAELGSD---FNEKTRDAWAKA 1ith AKLVGGVFQEE--FSADPTTVAAWGDA 1ecd .AGFVSYMKAHT--DF-AGAEAAWGAT 2hbg .ASLLSAMEHRIGGKMNAAAKDAWAAA

  7. What is the protein MA Gold Standard? Structural Alignment If sequences can be aligned, the alignment should reflect structural similarities. Thus, the alignment should lead to “match” of common structural and functional elements.

  8. Aligning non-coding DNA sequences • Conserved signals in DNA for control of expression • Can infer evolutionary relationships • Can identify Important functional regions • A much harder problem!

  9. Methods for Multiple Alignment 1. Exhaustive search: extension of DP to multiple dimensions. E.g. MSA algorithm 2. Progressive alignment: compute tree of sequences, based on hierarchical clustering, and then merge closest first, greedily. E.g. ClustalW 3. Anchor on locally conserved blocks: find highly conserved regions and then grow alignment around these regions. E.g. BLAST 4. Iterative search: based on genetic algorithm search 5. Probabilistic/statistical: E.g. Gibbs Sampling, HMM

  10. How to score a Multiple Alignment? Sum of Pairs = SP Compute the pairwise score of all pairs of characters and then sum them up, for each aligned column of the sequences, : SP-score ( I , - , I, V ) = s(I, -) + s(I, I) + s(I, V) + s(-, l) + s(-, V) + s(I, V) Note that s( - , - ) = 0 Gap penalty: can be constant or linear MSA algorithm uses constant

  11. Multidimensional Dynamic Programming Why not just use same technique as for pairwise alignment? Instead of 2-dimensional matrix, use N-dimensional; N = the number of sequences. Complexity increases with the number of sequences, so only N < 10 and lengths ~ 200 can be accommodated.

  12. Dynamic Programming with scores and penalties • from‘i-th’ pos. in A and ‘j-th’ pos. in B, ‘k-th’ pos. in C onward SP-score (A[i] , B[j], c[k]) + S[i+1, j+1, k+1] S[i , j, k] = max max { S[i+x, j, k] – w( x ); } max { S[i, j+y, k] – w( y ); } max { S[i, j, k+z] – w( z ); } max { S[ i+x, j+y, k ] – w( x ) – w( y ); } . . . . . . . . . . . . . best score from i, j, k onward

  13. MSA Algorithm Based on dynamic programming concept, using some bounds : 1. Compute optimal pairwise alignments to get an upper bound on any pair of alignments. MSA can’t do any better than sum of optimal pairwise alignments. 2. Create heuristic multiple alignment in ad hoc fashion to create a lower bound on MA score (using a guide tree). 3. Search N-dimensional scoring matrix for the best score includingi-th element of sequence 1, j-th of sequence 2, k-th of sequence 3, …, etc.

  14. AGT A-T -GT A-T -GT

  15. Problem of Sequence Weights The available sequences are not randomly sampled, but reflect biases in how we collect sequences. If weight everything equally, then closely related sequences will be allowed to dominate the multiple alignment. As a result, conclusions about 1) conservation 2) evolutionary distance 3) reliability of predictions will be wrong.

  16. Sequence Weighting Example CYEGNGHF Human-1 CYEGNGDF Human-2 CYHGNGDS Mouse CYHGNGQS Rat CFNGNGHS Fruitfly Solutions: don’t weight the two humans equally with the others. Use a measure of similarity to down-weight their influence on the multiple alignment.

  17. Feng-Doolittle Progressive MSA • 1. Do global pairwise alignments • (Needleman and Wunsch) for every pair of sequences • Create a guide tree based on them • (e.g., neighbor joining) • 3. Progressively align the sequences with weights from the guide tree

  18. Progressive MSA stage 1 of 3: generate global pairwise alignments five distantly related lipocalins best score

  19. Number of pairwise alignments needed For N sequences, (N-1)(N)/2 For 5 sequences, (4)(5)/2 = 10 ~ N2 / 2

  20. Feng-Doolittle stage 2: guide tree • Convert similarity scores to distance scores • Use some clustering algorithm to construct the guide tree (UPGMA) • A tree shows the distance between objects • A guide tree is not a phylogenetic tree

  21. Progressive MSA stage 2 of 3: generate a guide tree calculated from the distance matrix 4 1 2 3 5

  22. Feng-Doolittle stage 3: progressive alignment • Make successive alignment based on the order in the guide tree • Start with the two most closely related sequences • Then add the next closest sequence (or cluster) • Continue until all sequences are added • Rule: “once a gap, always a gap.”

  23. Progressive MSA stage 3 of 3: progressively align the sequences

  24. Why “once a gap, always a gap”? • Where gaps are added is a critical question • Gaps are often added to the first two (closest) • sequences • To change the initial gap choices later on would be • to give more weight to distantly related sequences • To maintain the initial gap choices is to trust • that those gaps are most believable

  25. Problem with Progressive algorithms Dependence of the ultimate MSA on the initial pairwise sequence alignment with the highest score Errors in initial alignments are propagated Gaps can proliferate, if not careful Gaps can be amino-acid specific, so that you penalize introduction of gaps into segments that are less likely to have gaps (e.g. hydrophobic core)

  26. Multiple sequence alignment to profile HMMs • Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged In a column of a multiple sequence alignment • HMMs are probabilistic models • Like a hammer is more refined than a blast, an HMM gives more sensitive alignments traditional techniques such as progressive alignments

More Related