240 likes | 461 Views
Multiple Sequence Alignments. It is God’s privilege to conceal things, but the kings’ pride is to research them. (Proverbs 25:2; ascribed to King Solomon of Israel, BC 1000). 1-4, Jan, 2006 Protein Folding Winter School Keehyoung Joo School of Computational Sciences, KIAS , Seoul, Korea.
E N D
Multiple Sequence Alignments It is God’s privilege to conceal things, but the kings’ pride is to research them. (Proverbs 25:2; ascribed to King Solomon of Israel, BC 1000) 1-4, Jan, 2006 Protein Folding Winter School Keehyoung Joo School of Computational Sciences, KIAS, Seoul, Korea
The major goal of computational sequence analysis is to predict the structure and function of genes and proteins from their sequence.
Contents • How to make your model from sequence ? • What is a Multiple Sequence Alignment(MSA)? • How can I use a MSA (Motivation) ? • What is the matter of MSA ? • The choice of the sequences • The choice of an objective function • The optimization of that function • How to make MSA ?
T T C C P A V R S I S N F How to make your model from sequence ? • Tertiary structure prediction methods • Homology modeling • Fold Recognition • Ab. Initio method Fold DB Protein Data Bank Find template folds and alignment Unknown Sequence Modeling from templates and alignment
What is a Multiple Sequence Alignment MSA can be seen as a generalization of Pairwise Sequence Alignment.
How can I use a MSA (Motivation) • Clustering, classification, or categorization of genes/proteins. • Identification of conserved region. • Detecting point mutations. • Deducing evolutionary relationship and phylogenetic tree. • Assist in predicting secondary and tertiary structure.
Optimization of that function What is the good alignment? (Computation) What is the matter of MSA ? • It stands at the cross road of three distinct technical difficulties. Choice of an objective function Choice of the sequences What is a good alignment? (Biology) Database Search Unknown Sequence
The Choice of the sequences : Sequences sharing a common ancestor (homologous sequences) • PSI-BLAST, FASTA, Various Search Tools • The Choice of an objective function Biological problem that lies in the definition of correctness • Sum of pair, Entropy score, Consistency based, … • The Optimization of that function • Exact Algorithms (Dynamic Programming) • Progressive alignment (ClustalW) • Iterative approaches (SA, GA, …)
Example : Sum of pair score Seq A: ARGTCAGATACGLAG---PGMCTETWV Seq B: ARATCGGAT---IAGTIYPGMCTHTWV Sequence alignments Scoring substitutions are represented in matrices. The popular ones are PAM or BLOSUM.
Example : Sum of pair score (Cont.) Multiple Sequence alignments Seq A1: ARGTCAGATACGLAG---PGMCTETWV---- Seq A2: ARATCGGAT---IAGTIYPGMCTHTWVIAGQ Seq A3: ARATCE--TACG--GTI-PGMCTHTWVIA-- Exact method : multi-dimensional dynamic programming -Time complexity O(Ln2n), Space complexity O(Ln)
Recent research in literature • MAFFT (2002) based on fast fourier transform • MUSCLE (2004) progressive alignment, pairwise profile alignment, position specific gap penalty, • PROBCONS (2005) progressive alignment, probability table using HMM, probabilistic consistency-based MSA
1 + 2 1 + 3 1 + 4 2 + 3 2 + 4 3 + 4 Example : Progressive alignment Pairwise Alignment Guide Tree MSA by adding sequences 1 2 3 4 2 3 4 1 1 2 3
Progressive alignment (cont.) Sequence Guide Tree 1 2 3 4 5 1 1 2 3 4 5 Distance Matrix: displays distances of all sequence pairs. 2 4 5 3 D = 1 - S UPGMA(unweighted pair group method of arithmetic averages) or Neighbour-Joining method
3 3 3 3 5 5 5 5 1 1 1 1 2 2 2 2 4 4 4 4 UPGMA Clustering (Guide Tree) d d d d ij ij ij ij 1 2 3 4 5 1 0 2 6 9 7 2 0 5 7 7 3 0 5 4 4 0 3 5 0 u 3 v u 0 5 7 3 0 4 v 0 u w u 0 6 w 0 6 0 u 3 4 5 u 0 5 8 7 3 0 5 4 4 0 3 5 0 2 0 . 5 . 5 . 5 . . 8 5 4 0 . . 5 5 3 0
Progressive alignment (cont.) • Columns - once aligned - are never changed. . . and new gaps are inserted. • Depend strongly on pairwise alignments and the intitial startingsequences • No guarantee that the global optimal solution will be found. • In case of sequences identity less than 25-30%, this approach become much less reliable. Guide Tree Alignment of alignments 1 2 4 5 2 3 1
Progressive Alignment: Discussion • Strengths: • Speed • Progression biologically sensible (aligns using a tree) • Weaknesses: • No objective function. • No way of quantifying whether or not the alignment is good • Local minimum problem
Consistency based score function Coffee Score function (Cedric Nortredame) : Given a set of sequences, the optimal MSA is defined as the one that agrees the most with all the possible optimal pair-wise alignments Score(Aij) = Number of aligned pairs of residues that are shared between Aij and the library. • do not depend on a specific substitution matrix • position dependant alignment. • the most consistent are often closer to the truth
Summary • MSAs are essential tools in computational biology and bioinformatics. They are required for structure /function analysis and structure prediction. • No perfect method exists for assembling a MSA and all the available methods do approximations. • The most commonly used methods for MSA use a progressive alignment algorithm (ClustalW) • Recent progress have focused on the desigh of iterative (Prrp, SAGA) and consistency based methods (T-Coffee, probcons)
MSA applications • Profile-profile alignment Profile: A table that lists the frequencies of each amino acid in each position of MSA. • Profile can be used in database searches • Find new sequences that match the profile • Improve search sensitivity • Improve search accuracy
Example: Profiles • Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. • Frequencies are calculated from a MSA containing a domain of interest • Allows us to identify consensus sequence • Derived scoring scheme allows us to align a new sequence to the profile • Profile can be used in database searches • Find new sequences that match the profile • Profiles also used to compute multiple alignments heuristically • Progressive alignment