300 likes | 315 Views
CrossWA is an algorithm that combines three-sequence alignment with pairwise alignment to improve the accuracy of multiple sequence alignment. It reduces the impact of initial branching order and introduces position-specific gap penalty. The algorithm is compared with existing methods and shown to provide more accurate alignments.
E N D
Outline • Introduction • Motivation • Algorithm • Experiments • Conclusions SSLAB, Deportment of computer science, National Tsing Hua University
Introduction • Multiple sequence alignment (MSA) • NP-hard problem • The heuristic methods for MSA • Progressive method • ClustalW, T-Coffee, POA, and etc. • Iterative method • Muscle, DIALIGN, and etc. • Probabilistic method • Probcons, Hmmt, Muscle, and etc. • Anchor-based method • MAFFT, Align-m , and etc. SSLAB, Deportment of computer science, National Tsing Hua University
Introduction (cont’) • Pairwise alignment • Use Dynamic programming to find the optimal alignment. [Needleman, J. Mol. Biol 1970; Smith, J. Mol. Biol 1981] • Three-sequence alignment • More accurate than pairwise alignment. [Murata, PNAS 1985] • Introduce linear gap penalty. [Gotoh, J. Theor. Biol 1986] • Space has been reduced from O(N3) to O(N2) with affine gap penalty. [Huang, ACM 1994] • Useful for MSA. [Makoto, Bioinformatics 1993; CY Lin, CMCT 2006, ICPP 2007] SSLAB, Deportment of computer science, National Tsing Hua University
Introduction (cont’) • Progressive multiple sequence alignment (Progressive pairwise MSA) • To align pair sequences following the branching order of the guide tree until all sequences are aligned. • The resulting alignment is affected by Initial branching order. • Problems of Gap • Gap will not be removed. • Insertion gap may be calculated multiple times. [Loytynoja, PNAS2005] SSLAB, Deportment of computer science, National Tsing Hua University
Introduction (cont’) • Progressive triple MSA - aln3nn • Published on [Matthias, BMC Bioinformatics July, 2007]. • Any alignment step is three-sequence alignment. • The three-sequence alignment uses the affine gap penalty same as [Huang, ACM 1994]. • Use Huang’s three-sequence alignment algorithm. SSLAB, Deportment of computer science, National Tsing Hua University
Motivation • CrossWA - combine three-sequence and pairwise alignments • Minimize the problem of Progressive pairwise MSA • Use three-sequence alignment to reduce the affection of initial branching order. • Increase the accuracy of alignment • Three-sequence alignment may obtain more accurate alignments. • Keep pairwise alignment because three-sequence alignment is not always better than pairwise alignment. • For pairwise, using position-specific gap penalty is more accurate than affine gap penalty. [Thompson, Bioinformatics 1995] • Introduce position-specific gap penalty into three-sequence alignment which is different to the algorithm “aln3nn”. • Avoid increasing the computing time SSLAB, Deportment of computer science, National Tsing Hua University
Motivation (cont’) • Comparison of three protein sequences among different methods SSLAB, Deportment of computer science, National Tsing Hua University
Motivation (cont’) • Three-sequence alignment VS Progressive pairwise MSA – with three sequences (430 test sets, random selected from BAliBase 2.0 Ref1 -5) • Three-sequence alignment with position-specific gap penalty and sequence weighting SSLAB, Deportment of computer science, National Tsing Hua University
Motivation (cont’) • Progressive pairwise MAS (ClustalW) VS Progressive Triple MSA (aln3nn) – reference set 1, BAliBase 2.0 [Matthias, BMC Bioinformatics 2007, 7] SSLAB, Deportment of computer science, National Tsing Hua University
General Process of Progressive Multiple sequence alignment . . . . . Step 2. Constructing guide tree Unaligned sequences Step 1. Calculating distance matrix Aligning pair sequence or group along the branching order . . Aligned sequences Step 3. Alignment SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm • Process of CrossWA • Step 1. construct distance matrix. • Step 2. build guide tree – Neighbour-Joining. • Sequence weights will be calculated. • Step 3. build a new guide tree modified from the guide tree. • Branches will be changed for three-sequence and pairwise alignments. • Sequence weights will be recalculated. • Step 4. Alignment. • Pairwise alignment • Three-sequence alignment • Compare with the alignment produced by progressive pairwise alignment with same three sequences and select better one. SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) . . . . . Unaligned sequences Step 1. Calculating distance matrix Step 2. Constructing guide tree Aligning pair or three sequences (or groups) along the branching order of new tree . . . . . Aligned sequences VS Step 3. Constructing new tree modified from the guide tree in step 2 Progressive Pairwise MSA Three-sequence alignment Step 4. Alignment SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) • The branch changing rule Type I Type II Type III SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) • The evaluation of three-sequence alignment • If SP(S’’) > SP(T’) then keep S’’ • IF SP(T’) > SP(S’’) then keep T’ A B C A B C S’ = Align(B, C) S’’ = Align(A, S’) T’ = Align(A, B, C) SSLAB, Deportment of computer science, National Tsing Hua University
Algorithm (cont’) • Modification of sequence weights • The calculation of sequence weight is same as ClustalW. D D B A C A C Weight of Hba_Human = 0.055 + 0.219/2 + 0.061/4 + 0.015/5 + 0.062/6 = 0.194 Length between node A and node C = 0.219 + 0.061 = 0.280 Weight of Hba_Human = 0.055 + 0.280/2 + 0.077/5 = 0.210 • The strategy of Gap penalty • Introduce position-specific gap penalty into three-sequence alignment (modified from ClustalW). SSLAB, Deportment of computer science, National Tsing Hua University
Experiments • System environment • Linux (AMD opteron 250 2.4G with 512MB of memory) • Data source • BAliBASE 2.0 • Reference sets (1 – 5). [T-Coffee, Muscle, Probcons, aln3nn, and etc] • Reference sets (6 – 8) contain repeats, inversions and transmembrane helices, for which none of the tested algorithms is designed. [Muscle] SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • Scoring functions • Sum-of-pair (SP) • Total Column Score (TC) • Proportion probability (%) • No. of best alignment of the method/No. of total test sets • Comparing algorithms • CrossWAfast, CrossWAfull, ClustalW 1.83, T-Coffee 5.05, Muscle 3.6. • CrossWAfast : only use the type I in the branch changing rule. • CrossWAfull : use all types in the branch changing rule. SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • The comparison of SP scores among different alignment methods SSLAB, Deportment of computer science, National Tsing Hua University
Experiment (cont’) • The comparison of TC scores among different alignment methods SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • The SP scores for each method of variant average identities in Reference 1 data set SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • The TC scores for each method of variant average identities in Reference 1 data set SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • The performance of CrossWA with 20 sequences SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • The Performance of CrossWA with 40 sequences SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • Comparison of performance among different methods with 20 sequences SSLAB, Deportment of computer science, National Tsing Hua University
Experiments (cont’) • Comparison of performance among different methods with 40 sequences SSLAB, Deportment of computer science, National Tsing Hua University
Conclusions • Three-sequence alignment can obtain better resulting alignment than pairwise alignment, but not for all data sets. • Combining three-sequence alignment and pairwise alignment can keep better alignment at any alignment step in progressive MSA. • From the experimental results, CrossWA can be another useful tool to align multiple sequence. • CrossWA can be used to align DNA sequences. • For aligning Genome data, computing time is a problem. It can be solved by parallel programming. [CY Lin, ICPP 2007] SSLAB, Deportment of computer science, National Tsing Hua University
Web service Http://140.114.91.10/Genome SSLAB, Deportment of computer science, National Tsing Hua University
Reference • Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-453.27. [Needleman, J Mol Biol 1970] • Smith TF, Waterman MS : Identification of common molecular subsequences. J. Mol. Biol. 1981, 147:195-197. [Smith, J Mol Biol 1981] • Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci U S A. 1985, 82:3073-3077. [Murata, PNAS 1985] • Gotoh O: Alignment of three biological sequences with an efficient traceback procedure, J Theor Biol 1986, 327-337. [Gotoh, J Theor Biol 1986] • Huang X: Alignment of three sequences in quadratic space. Applied Computing Review 1993, 1:7-11. [Huang, ACM 1993] • Makoto H, Maski H, Masato I, Tomoyuki T: MASCOT: multiple alignment system for protein sequences based on three-way dynamic programming, J Mol Biol 1993, 2:161-167. [Makoto, Bioinformatics 1993] SSLAB, Deportment of computer science, National Tsing Hua University
Reference (cont’) • CY Lin, CT Huang, YC Chung, Chuan YT: Parallel Three-sequence Alignment with Space-efficient,Proceedings of the 23th Workshop on Combinatorial Mathematics and Computation Theory, Chang-Hua, Taiwan, April 2006, 160-165. [CY Lin, CMCT 2006] • CY Lin, CT Huang, YC Chung, Chuan YT: Efficient Parallel Algorithm for Optimal Three-Sequences Alignment. International Conference on Parallel Processing 2007. [CY Lin, ICPP 2007] • Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005,102(30):10557-10562. [Loytynoja, PNAS 2005] • Matthias K, Peter FS: Progressive multiple sequence alignments from triplets. BMC Bioinformatics 2007. [matthias, BMC Bioinformatics July, 2007] • Thompson JD: Introducing variable gap penalties to sequence alignment in linear space. Bioinformatics 1995, 11:181-186. [Thompson, Bioinformatics 1995] SSLAB, Deportment of computer science, National Tsing Hua University
Thank you for your attention SSLAB, Deportment of computer science, National Tsing Hua University