1 / 30

CrossWA: Improving Multiple Sequence Alignment Accuracy with Three-Sequence Alignment

CrossWA is an algorithm that combines three-sequence alignment with pairwise alignment to improve the accuracy of multiple sequence alignment. It reduces the impact of initial branching order and introduces position-specific gap penalty. The algorithm is compared with existing methods and shown to provide more accurate alignments.

vlowe
Download Presentation

CrossWA: Improving Multiple Sequence Alignment Accuracy with Three-Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Introduction • Motivation • Algorithm • Experiments • Conclusions SSLAB, Deportment of computer science, National Tsing Hua University

  2. Introduction • Multiple sequence alignment (MSA) • NP-hard problem • The heuristic methods for MSA • Progressive method • ClustalW, T-Coffee, POA, and etc. • Iterative method • Muscle, DIALIGN, and etc. • Probabilistic method • Probcons, Hmmt, Muscle, and etc. • Anchor-based method • MAFFT, Align-m , and etc. SSLAB, Deportment of computer science, National Tsing Hua University

  3. Introduction (cont’) • Pairwise alignment • Use Dynamic programming to find the optimal alignment. [Needleman, J. Mol. Biol 1970; Smith, J. Mol. Biol 1981] • Three-sequence alignment • More accurate than pairwise alignment. [Murata, PNAS 1985] • Introduce linear gap penalty. [Gotoh, J. Theor. Biol 1986] • Space has been reduced from O(N3) to O(N2) with affine gap penalty. [Huang, ACM 1994] • Useful for MSA. [Makoto, Bioinformatics 1993; CY Lin, CMCT 2006, ICPP 2007] SSLAB, Deportment of computer science, National Tsing Hua University

  4. Introduction (cont’) • Progressive multiple sequence alignment (Progressive pairwise MSA) • To align pair sequences following the branching order of the guide tree until all sequences are aligned. • The resulting alignment is affected by Initial branching order. • Problems of Gap • Gap will not be removed. • Insertion gap may be calculated multiple times. [Loytynoja, PNAS2005] SSLAB, Deportment of computer science, National Tsing Hua University

  5. Introduction (cont’) • Progressive triple MSA - aln3nn • Published on [Matthias, BMC Bioinformatics July, 2007]. • Any alignment step is three-sequence alignment. • The three-sequence alignment uses the affine gap penalty same as [Huang, ACM 1994]. • Use Huang’s three-sequence alignment algorithm. SSLAB, Deportment of computer science, National Tsing Hua University

  6. Motivation • CrossWA - combine three-sequence and pairwise alignments • Minimize the problem of Progressive pairwise MSA • Use three-sequence alignment to reduce the affection of initial branching order. • Increase the accuracy of alignment • Three-sequence alignment may obtain more accurate alignments. • Keep pairwise alignment because three-sequence alignment is not always better than pairwise alignment. • For pairwise, using position-specific gap penalty is more accurate than affine gap penalty. [Thompson, Bioinformatics 1995] • Introduce position-specific gap penalty into three-sequence alignment which is different to the algorithm “aln3nn”. • Avoid increasing the computing time SSLAB, Deportment of computer science, National Tsing Hua University

  7. Motivation (cont’) • Comparison of three protein sequences among different methods SSLAB, Deportment of computer science, National Tsing Hua University

  8. Motivation (cont’) • Three-sequence alignment VS Progressive pairwise MSA – with three sequences (430 test sets, random selected from BAliBase 2.0 Ref1 -5) • Three-sequence alignment with position-specific gap penalty and sequence weighting SSLAB, Deportment of computer science, National Tsing Hua University

  9. Motivation (cont’) • Progressive pairwise MAS (ClustalW) VS Progressive Triple MSA (aln3nn) – reference set 1, BAliBase 2.0 [Matthias, BMC Bioinformatics 2007, 7] SSLAB, Deportment of computer science, National Tsing Hua University

  10. General Process of Progressive Multiple sequence alignment . . . . . Step 2. Constructing guide tree Unaligned sequences Step 1. Calculating distance matrix Aligning pair sequence or group along the branching order . . Aligned sequences Step 3. Alignment SSLAB, Deportment of computer science, National Tsing Hua University

  11. Algorithm • Process of CrossWA • Step 1. construct distance matrix. • Step 2. build guide tree – Neighbour-Joining. • Sequence weights will be calculated. • Step 3. build a new guide tree modified from the guide tree. • Branches will be changed for three-sequence and pairwise alignments. • Sequence weights will be recalculated. • Step 4. Alignment. • Pairwise alignment • Three-sequence alignment • Compare with the alignment produced by progressive pairwise alignment with same three sequences and select better one. SSLAB, Deportment of computer science, National Tsing Hua University

  12. Algorithm (cont’) . . . . . Unaligned sequences Step 1. Calculating distance matrix Step 2. Constructing guide tree Aligning pair or three sequences (or groups) along the branching order of new tree . . . . . Aligned sequences VS Step 3. Constructing new tree modified from the guide tree in step 2 Progressive Pairwise MSA Three-sequence alignment Step 4. Alignment SSLAB, Deportment of computer science, National Tsing Hua University

  13. Algorithm (cont’) • The branch changing rule Type I Type II Type III SSLAB, Deportment of computer science, National Tsing Hua University

  14. Algorithm (cont’) • The evaluation of three-sequence alignment • If SP(S’’) > SP(T’) then keep S’’ • IF SP(T’) > SP(S’’) then keep T’ A B C A B C S’ = Align(B, C) S’’ = Align(A, S’) T’ = Align(A, B, C) SSLAB, Deportment of computer science, National Tsing Hua University

  15. Algorithm (cont’) • Modification of sequence weights • The calculation of sequence weight is same as ClustalW. D D B A C A C Weight of Hba_Human = 0.055 + 0.219/2 + 0.061/4 + 0.015/5 + 0.062/6 = 0.194 Length between node A and node C = 0.219 + 0.061 = 0.280 Weight of Hba_Human = 0.055 + 0.280/2 + 0.077/5 = 0.210 • The strategy of Gap penalty • Introduce position-specific gap penalty into three-sequence alignment (modified from ClustalW). SSLAB, Deportment of computer science, National Tsing Hua University

  16. Experiments • System environment • Linux (AMD opteron 250 2.4G with 512MB of memory) • Data source • BAliBASE 2.0 • Reference sets (1 – 5). [T-Coffee, Muscle, Probcons, aln3nn, and etc] • Reference sets (6 – 8) contain repeats, inversions and transmembrane helices, for which none of the tested algorithms is designed. [Muscle] SSLAB, Deportment of computer science, National Tsing Hua University

  17. Experiments (cont’) • Scoring functions • Sum-of-pair (SP) • Total Column Score (TC) • Proportion probability (%) • No. of best alignment of the method/No. of total test sets • Comparing algorithms • CrossWAfast, CrossWAfull, ClustalW 1.83, T-Coffee 5.05, Muscle 3.6. • CrossWAfast : only use the type I in the branch changing rule. • CrossWAfull : use all types in the branch changing rule. SSLAB, Deportment of computer science, National Tsing Hua University

  18. Experiments (cont’) • The comparison of SP scores among different alignment methods SSLAB, Deportment of computer science, National Tsing Hua University

  19. Experiment (cont’) • The comparison of TC scores among different alignment methods SSLAB, Deportment of computer science, National Tsing Hua University

  20. Experiments (cont’) • The SP scores for each method of variant average identities in Reference 1 data set SSLAB, Deportment of computer science, National Tsing Hua University

  21. Experiments (cont’) • The TC scores for each method of variant average identities in Reference 1 data set SSLAB, Deportment of computer science, National Tsing Hua University

  22. Experiments (cont’) • The performance of CrossWA with 20 sequences SSLAB, Deportment of computer science, National Tsing Hua University

  23. Experiments (cont’) • The Performance of CrossWA with 40 sequences SSLAB, Deportment of computer science, National Tsing Hua University

  24. Experiments (cont’) • Comparison of performance among different methods with 20 sequences SSLAB, Deportment of computer science, National Tsing Hua University

  25. Experiments (cont’) • Comparison of performance among different methods with 40 sequences SSLAB, Deportment of computer science, National Tsing Hua University

  26. Conclusions • Three-sequence alignment can obtain better resulting alignment than pairwise alignment, but not for all data sets. • Combining three-sequence alignment and pairwise alignment can keep better alignment at any alignment step in progressive MSA. • From the experimental results, CrossWA can be another useful tool to align multiple sequence. • CrossWA can be used to align DNA sequences. • For aligning Genome data, computing time is a problem. It can be solved by parallel programming. [CY Lin, ICPP 2007] SSLAB, Deportment of computer science, National Tsing Hua University

  27. Web service Http://140.114.91.10/Genome SSLAB, Deportment of computer science, National Tsing Hua University

  28. Reference • Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-453.27. [Needleman, J Mol Biol 1970] • Smith TF, Waterman MS : Identification of common molecular subsequences. J. Mol. Biol. 1981, 147:195-197. [Smith, J Mol Biol 1981] • Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci U S A. 1985, 82:3073-3077. [Murata, PNAS 1985] • Gotoh O: Alignment of three biological sequences with an efficient traceback procedure, J Theor Biol 1986, 327-337. [Gotoh, J Theor Biol 1986] • Huang X: Alignment of three sequences in quadratic space. Applied Computing Review 1993, 1:7-11. [Huang, ACM 1993] • Makoto H, Maski H, Masato I, Tomoyuki T: MASCOT: multiple alignment system for protein sequences based on three-way dynamic programming, J Mol Biol 1993, 2:161-167. [Makoto, Bioinformatics 1993] SSLAB, Deportment of computer science, National Tsing Hua University

  29. Reference (cont’) • CY Lin, CT Huang, YC Chung, Chuan YT: Parallel Three-sequence Alignment with Space-efficient,Proceedings of the 23th Workshop on Combinatorial Mathematics and Computation Theory, Chang-Hua, Taiwan, April 2006, 160-165. [CY Lin, CMCT 2006] • CY Lin, CT Huang, YC Chung, Chuan YT: Efficient Parallel Algorithm for Optimal Three-Sequences Alignment. International Conference on Parallel Processing 2007. [CY Lin, ICPP 2007] • Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005,102(30):10557-10562. [Loytynoja, PNAS 2005] • Matthias K, Peter FS: Progressive multiple sequence alignments from triplets. BMC Bioinformatics 2007. [matthias, BMC Bioinformatics July, 2007] • Thompson JD: Introducing variable gap penalties to sequence alignment in linear space. Bioinformatics 1995, 11:181-186. [Thompson, Bioinformatics 1995] SSLAB, Deportment of computer science, National Tsing Hua University

  30. Thank you for your attention SSLAB, Deportment of computer science, National Tsing Hua University

More Related