340 likes | 464 Views
Multiple Sequence Alignment Based on Compact Set. Department of Computer Science National Tsing Hua University Chuan Yi Tang. S 1 : ATTCG S 2 : AGTCG S 3 : ATCAG. S ’ 1 : A T – T C – G S ’ 2 : A – G T C – G S ’ 3 : A T – – C A G. 2. MSA. 2. 4. Cost = 8.
E N D
Multiple Sequence Alignment Based on Compact Set Department of Computer Science National Tsing Hua University Chuan Yi Tang
S1:ATTCG S2:AGTCG S3:ATCAG S’1:A T – T C – G S’2:A – G T C – G S’3:A T –– C A G 2 MSA 2 4 Cost = 8 Multiple Sequence Alignment • Given s set of sequences,the MSA problem is to find an alignment of the sequences such that some object function is minimized • ie.(Sum of Pair Score)
MSA with SP-Score:Exact Algorithm and Heuristics • k : # of Sequences n : Sequences of length • Exactly (using Dynamic Programming) • O((2n)k):D.Snakoff, Simultaneous solution of RNA folding, alignment and Protosequence prolblems, SIAM J. Appl. Math.,(1985) • Heuristics • D.F.Feng,R.F.Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360., (1987) • S.F.Altschul,D.J.Lipman, Trees,star and mutiple biological sequence aligment,SIAM J. Appl. Math.,(1989) • D.J.lipman,S.F.Altschul, A tool for multiple sequences alignment,Proc.Nat.Acad. Sci. U.S.A.,(1989) • S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of multiples sequences comparison methods,Bull.Math Bio.,(1992)
MSA with SP-Score:Complexity • J Comput Biol 1994 Winter;1(4):337-48 On the complexity of multiple sequence alignment. Wang L. Jiang T. McMaster University, Hamilton, Ontario, Canada. We study the computational complexity of two popular problems in multiple sequence alignment: 1. multiple alignment with SP-Score => NP-complete(non-metric) 2. multiple tree alignment => MAX SNP-hard • Theoretical Computer Science;259 (2001) 63-79 The complexity with Multiple sequence alignment with SP-score that is a metric Paola Bonizzoni, Gianluca Della Vedoa 1. multiple alignment with SP-Score => NP-complete(metric)
MSA with SP-Score:Approximation • Approximation Algorithm: • Performance ratio of 2-2/k:D.Gusfilde,Efficient methods for multiple sequence alignment with guaranteed error bounds,Bull. Math Bio.,(1993) • Performance ratio of 2-3/k:P.Pevzner,Multiple alignment,communication cost,and graph matching,SIAM J. Appl. Math.,(1992) • Performance ratio of 2-l/k(assembling l-way alignments,l£ k):V.Bafna,E.L.Lawler and Pevzner,Approximation algorithms for multiple sequences alignment,Theor. Comput. Sci.,(1997) • Polynomial Time Approximation Scheme(PTAS): • MSA within a constant band and allows only constant number of insertion and deletion gaps of arbitrary length per sequence on average :M. Li,B. Ma. And L. Wang, Near optimal alignment within a band in polynomial time,STOC 2000.
Compact Set Definition • Let S be the set of n objects {S1,S2,S3…Sn} and D(Si,Sj) denote the distance between Si and Sj in the distance matrix D. • Consider any C which is a subset of S,if the distance between elements in C and not in C is larger than the longest distance in C , then C is called a compact set. • Property : • The entire set S is a compact set. • Each set consisting of a single object is also a compact set.
Compact Set Example 11Minimal border edge for compact set 3 S6 S5 10Maximal inside edge for compact set 3 S1 S4 Compact Set 1 Distance Matrix S2 S3 Compact Set 2 Compact Set 3
Compact Set Example(con’t) • Compact Set is hierarchical
MSA & Compact Set • Consider 12 Protein sequences example: • S1 :MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDANKYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGVI • S2 :SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDVTVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF • S3 :SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYDVTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF • S4 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII • S5 :MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIKKAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI • S6 : MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQLFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY • S7 :MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVFIVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALDVANKIGII • S8 :MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIKQAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI • S9 :MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNNTLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEVDALDIANKIGFI • S10 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII • S11 :APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKYQIKKAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI • S12 :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAMKKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAYIRLSASYDALDTANKMGLV Original sequence
MSA & Compact Set(con’t) Original distance matrix Original Compact Set Tree Good MSA should Preserve Compact Set as well
MSA & Compact Set(con’t) • S1’ :-----------------MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA… • S2’ :---------------------------------------------------------------------------------SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDV… • S3’ :--------------------------------------------------------------------------------SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEV… • S4’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES… • S5’ :----------------------MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKK… • S6’ :------------------------------------------------------------------------------MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEA… • S7’ : ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE… • S8’ :----------------------MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKK… • S9’ :------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN… • S10’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE… • S11’ : -----------------------APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKK… • S12’ :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP… MSA by MSA1
MSA & Compact Set(con’t) • S1’ : ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQHPL… • S2’ : ---------------------------------------------------------------------------------------S--SIIDYPLVTEKAMDEMDFQNKLQFIVDID- AAK… • S3’ : ---------------------------------------------------------------------------------------SW-DVIKHPHVTEKAMNDMDFQNKLQFAVD-DRA… • S4’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKFP… • S5’ : -----------------MAPST-KATAAKKAVVKGT-NG--K--KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIEQPITSET… • S6’ : -------------------------------------------------------------------------------------MDAF-DVIKTPIVSEKTMKLIEEENRLVFYVER-KATK… • S7’ : MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILKYP… • S8’ : -----------------MAPST-KAASAKKAVVKGS-NG--S--KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIVAPIASET… • S9’ : MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR… • S10’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKF… • S11’ : ------------------APSA-KATAAKKAVVKGT-NG--K--KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VIEQPITSET… • S12’ : ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR… MSA by MSA2
MSA & Compact Set(con’t) Compact Set Tree by MSA1 Distance Matrix by MSA1
MSA & Compact Set(con’t) Compact Set Tree by MSA2 Distance Matrix by MSA2
Measure of Compact Set Preservation • How can we measure the Compact Set Preservation in quantity? N1: # of the original Compact Set relations N2: # of the relations preserved after MSA Estimate by Compact Set Preservation =
Compact Set Tree Measure of Compact Set Preservation(con’t) Original Compact Set relations 1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3 4 5 1 4 5 2 4 5 3 Distance Matrix N1 = 10
Measure of Compact Set Preservation(con’t) The relations preserved after MSA 1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3 4 5 1 4 5 2 4 5 3 1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4 5 2 4 5 3 5 4 × × × × × × × After MSA ======> Distance Matrix After MSA N2=10-7=3 => Compact Set Tree after MSA Estimate by Compact Set Preservation = 3/10
Why Pair Wise Compact Set? • Evolutionary tree is the real judge • Evolutionary tree has property to minimize the total evolutionary edges (say tree size) from pair wise distance which seems to be compact • It is true in experiments
Compact Set Relation Preserved Rate for Evolutionary Tree # of relations preserved in Evolutionary Tree / # of Compact Set relations of Pair Wise Distance More larger more better
Compact Set Evaluation Algorithm • Step1 : Construct the original Compact Set Tree T and the Compact Set Tree after MSA T’ [1]. • Step2 : Preorder Traversal T’ to generate the Compact Set relations after MSA R’,and mark the entry in the hash table H’ according to R’. • Step3 : Preorder Traversal T to generate the Original Compact Set Relations R ,and check whether the marked entry in the hash table by R is a subset of the hash table H’. • Total Time Complexity = O( ),where n is the number of sequences • Reference: • 1. E. Dekel,J. Hu and W. Ouyang, An optimal algorithm for finding compact sets, Inform. Process. Lett. 44(1992) 285~289
Our Strategy for MSA • Progressive alignment (Fei Feng and Doolittle: 1987 ) with neighbor first( by using Minimal Spanning Tree(MST) Kruskal Merging Order) • Set-to-Set align. Once a gap, always a gap. Kruskal merging order tree 3 S3:----ACAGACTCCA S4:TTTAAAAGTC---- 1 2 set1 S1 S2 S3 S4 S1:---AACAGACTT-A- S2:----ACAGACTT-AA S3:----ACAGACTCCA- S4:TTTAAAAGTC----- S1:AACAGACTTA- S2:-ACAGACTTAA set2
Q: Why do we use MST Kruskal Order? A1:It has similar structure with compact set MST Order Merge Tree Compact Tree A2:MST Kruskal order is obtained easily
Score function Match Begin- gap Gap-extended ---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TTTAAAAGTC-C--- End-gap Mismatch Gap-open
Strategy of set-to-set alignment Score(8, 8) = Max{ Score(7, 7) +(α8:β8) Score(7, 8) +(α8:G3) Score(8, 7) +(G2:β8) *(α8:β8) = (G,C)+(G,-)+(G,G)+(-,C)+(-,-)+(-,G) = (-10)+(-15)+(10)+(-15)+(0)+(-15) = -45 Time Complexity of setα to setβ alignment = (sα*sβ*lα*lβ )=(2*3*8*8), Where sα,sβ are the number of sequences in setα and setβrespectively, and lα,lβ are the length of resulted sequences in setα and setβ respectively.
Time Complexity of our strategy • The worst case happens in that the binary tree is balanced. • Total set-to-set time complexity is bounded by • where l is the length of the resulted sequences and n is the number of sequences. • The worst case time complexity = O(n2l2 )
MSA Useful tools • GCG (Genetics Computer Group) : PileUp • http://gcg.nhri.org.tw:8003/gcg-bin/seqweb.cgi • Clustalw • http://clustalw.genome.ad.jp/
Clustal W • Pairwise alignment • Calculate distance matrix • Construct the unrooted Neighbor-Joining (NJ) tree • Construct the rooted NJ tree • rooted at “mid-point” • Progressive alignment • Align following the rooted NJ tree • set-to-set alignment
SP Score Result Clustalw and our result are better than GCG’s More larger more better
Compact Set Relation Failure rate Result # of relation not preserved / # of source compact set relation More smaller more better
Three-point Relative Scale Preserved Rate For all three species A, B,C, we evaluate their relative distance relation between original distance matrix and the MSA distance are identical or not.
I Believe Tree Only • One might still not believe original pair wise distance is not a good judge • One believes the true evolutionary tree only
Compact Set Relation Failure Rate Take Protein 12 for example # of relations not preserved / # of source Compact Set relations Distance MSA_Method More smaller more better
Future Work • Is our measurement and algorithms really good? Simulations and Web service • Does Our MSA by set-to-set alignment satisfy some approximation property? Theoretical Proving • How can we reduce the time? Hardwired Dynamic Programming ex:PARACEL http://www.paracel.com/