630 likes | 834 Views
Lecture 8. Chapter 6 Multiple Sequence Alignment Methods. The major goal of computational sequence analysis is to predict the structure and function of genes and proteins from their sequence. Biological Motivation. Compare a new sequence with the sequences in a
E N D
Lecture 8 Chapter 6 Multiple Sequence Alignment Methods
The major goal of computational sequence analysis is to predict the structure and function of genes and proteins from their sequence.
Biological Motivation • Compare a new sequence with the sequences in a • protein family. Proteins can be categorized into • families. A protein family is a collection of • homologous proteins with similar sequence, 3-D • structure, function, and/or similar evolutionary • history. • Gain insight into evolutionary relationships. By • looking at the number of mutations that are • necessary to go from an ancestor sequence to an • extant sequence, one can get an estimate for the • amount of time that the two sequences diverged in • the evolutionary history.
z x y The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Contents • What a multiple alignment means • Scoring a multiple alignment • Position specific (minimum entropy) scores • Sum of pair scores • Multidimensional dynamic programming • Progressive alignment methods • Multiple alignment by profile HMM training
Multiple Sequence Alignment • In chapter 5, we assumed that a reasonable multiple sequence alignment was already known and provided the starting point for constructing a profile HMM • We know look at what a “reasonable” multiple alignment is, and at ways to construct one automatically from unaligned sequence • MSA must usually be inferred from primary sequences alone (一個蛋白的一級(維)結構是指由特定序列的氨基酸排列形成的胜肽鍵串)
MSA Biological sequences are typically grouped into functional families. Biologists produce high quality multiple sequence alignments by hand using expert knowledge. Important factors are: • Specific sorts of columns in alignments, such as highly conserved residues or buried hydrophobic residues; • The influence of the secondary structure (α-helices, β-strands etc.) and tertiary structure, the alteration of by hydrophobic and hydrophilic columns in exposed β-strands, etc; • Expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence. • Phylogenetic relationships between sequences, that dictate constraints on the changes that occur in columns and in the patterns of gaps.
MSA • Manual multiple alignment is tedious • An automatic method must have a way to assign a score so that better multiple alignment get better scores
Multiple Sequence Alignment: Why? • Identify highly conserved residues • Likely to be essential sites for structure/function • More precision from multiple sequences • Better structure/function prediction, pairwise alignments • Building gene/protein families • Use conserved regions to guide search • Basis for phylogenetic analysis • Infer evolutionary relationships between genes
Multiple Sequence Alignment: Why? • Remember: The goal of biological sequence comparison is to discover functional (or structural ) similarities. • Unfortunately, if the sequence similarity is weak, pairwise alignment can fail to identify biologically related sequences (because weak pairwise similarities may fail the statistical test for significance). Indeed, similar proteins may not exhibit a strong sequence similarity. • The good news is that simultaneous comparison of many sequences often allows one to find similarities that are invisible in pairwise sequence comparison. • [Hubbard et al., 1996]: “Pairwise alignment whispers… multiple alignment shouts out loud.”
What a multiple alignment means • In a MSA, homologous residues among a set of sequences are aligned together in columns • Homologous is meant in both the structural and evolutionary sense • Ideally, a column of aligned residues occupy similar 3-D structural positions and all diverge from a common ancestral residue
Figure 6.1 • A manually generated multiple alignment of 10 immunoglobulin superfamily sequence (一群都帶著部分免疫球蛋白構形的蛋白質便統稱為 Immunoglobulin superfamily) • A crystal structure (晶體結構)of one of the sequences (ltlk, telokin) is known
Figure 6.1 • At the top: β-strands (a-g). At the bottom: identical residues (letter), or highly conservative residues (+). • The conserved regions include 8 β-strands and certain key residues such as two completely conserved cysteines (C) in the b and f strands • The other 9 sequences have been manually aligned to ltlk based on this expert structural knowledge
Issues • Automatic multiple sequence alignment methods are a topic of extensive research in bioinformatics • Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single “correct” multiple alignment • Since protein structures also evolve, we do not expect two protein structures with different sequences to be entirely superposable • Very similar sequences will generally be aligned unambiguously (a simple program can get the alignment right) • For cases of interest (e.g. a family of proteins with only 30% average pairwise sequence identity), there is no objective way to define an unambiguously correct alignment • Once again, in general, an automatic method must assign a score so that better multiple, alignments get better scores
Issues • For cases of interest (e.g. a family of proteins with only 30% average pairwise sequence identity), there is no objective way to define an unambiguously correct alignment • The globin family, often used as a “typical” protein family in computational work, is in fact exceptional: almost the entire structure is conserved among divergent sequences
The Choice of the sequences: Sequences sharing a common ancestor (homologous sequences) • PSI-BLAST, FASTA, Various Search Tools • The Choice of an objective function Biological problem that lies in the definition of correctness • Sum of pair, Entropy score, Consistency based, … • The Optimization of that function • Exact Algorithms (Dynamic Programming) • Progressive alignment (ClustalW) • Iterative approaches (SA, GA, …)
Problem Statement What are the conserved regions among a set of sequences over the same alphabet? 12345678Position Index EMQPILLLSequence 1 DMLR-LL-Sequence 2 NMK-ILLLSequence 3 DMPPVLILSequence 4 DM LL Consensus sequence
Scoring a Multiple Alignment The scoring system should take into account that: • Some positions are more conserved than others, e.g. position-specific scoring; • The sequences are not independent, but instead related by a phylogenetic tree.
Complex Scoring • Specify a complete probabilistic model of molecular sequence evolution • Given the correct phylogenetic tree for the sequences to be aligned, the probability for a multiple alignment is the product of the probabilities of all the evolutionary events necessary to produce that alignment via ancestral intermediate sequences times the prior probability for the root ancestral sequence
Complex Scoring • The probabilities of evolutionary change would depend on the evolution-ary times along each branch of the tree, as well as position-specific structural and functional constraints imposed by natural selection, so that the key residues and structural elements would be conserved • High-probability alignments would then be good structural and evolution-ary alignments under this model • Unfortunately, we do not have enough data to parametrise such a complex evolutionary model
Simplifying Assumptions • Partly or (as we do in this chapter) entirely ignore the phylogenetic tree • Consider that individual columns of an alignment are statistically independent
Defining a scoring function for multiple alignment • Almost all multiple alignment methods assume that the individual columns are statistically independent. • However, most multiple alignment methods use affine gap scoring functions, so successive gap residues are not treated independently. • For simplicity, here we will focus on definitions of S(mi) for scoring a column of aligned residues with no gaps, which leads to S(m) = i S(mi) m: multiple alignment, mi are columns
Sum of Pairs (SP) Scores This is the standard method for scoring multiple alignments • Assumes the statistical independence of columns. • Columns are scored by a “sum of pairs” (SP) function. The SP score for a column is defined as: S(mi) = k<l s(mik, mil) , where scores s(a,b) come from a substitution matrix such as PAM or BLOSUM.
Sum of Pairs (SP) Scores Drawback: • There is no probabilistic justification of the SP score. • Each sequence is scored as if it descended form N-1 other sequences instead of a single ancestor. Evolutionary events are over-counted, a problem which increases as the number of sequences increases. Altschul, Carroll & Lipman [1989] proposed a weighting scheme designed to partially compensate for this defect in SP scores.
Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Example : Sum of pair score Sequence alignments Seq A: ARGTCAGATACGLAG---PGMCTETWV Seq B: ARATCGGAT---IAGTIYPGMCTHTWV Scoring substitutions are represented in matrices. The popular ones are PAM or BLOSUM.
Similarity Measurement: SP-score Sum of Pairs (SP) -score is the similarity score among amino acids (or bases) at a particular position of a multiple sequence alignment. The gap-gap alignment has 0 similarity & distance score: s(-,-) = 0 S(M) = SUM s(mi , mj) i<j M is the collection of amino acids at a position of alignment. S(P,R,-,P) = s(P,R) + s(P,-) + s(P,P) + s(R,-) +s(R,P) + s(-,P)
Similarity Measurement: SP-score Multiple alignment: 1 PEAALFGKFT---IKSDVW 2 AESALYGRFT---IKSDVW 3 PDTAIWGKF---SIKSETW 4 PEVIRMGDDNPFSFQSDVW Use only sequences 2 and 3: 2 AESALYGRFT---IKSDVW 3 PDTAIWGKF---SIKSETW Remove positions which contain only gaps (which produces an induced pairwise alignment, or a projection of the multiple alignment in 2 dimensions): 2 AESALYGRFT-IKSDVW 3 PDTAIWGKF-SIKSETW
Similarity Measurement: SP-score 12345678Position Index EMQPILLLSequence 1 DMLR-LL-Sequence 2 NMK-ILLLSequence 3 DMPPVLILSequence 4 Given a multiple sequence alignment with Sum of Pairs (SP-score), we may compute the score of each position of the alignment and then add all the position scores to get the total score of the whole alignment. Or, we may compute the score for each induced pairwise alignment and add these scores. If we have N sequences, the number of pairs is N*(N-1)/2
Example : Sum of pair score (Cont.) Multiple Sequence alignments Seq A1: ARGTCAGATACGLAG---PGMCTETWV---- Seq A2: ARATCGGAT---IAGTIYPGMCTHTWVIAGQ Seq A3: ARATCE--TACG--GTI-PGMCTHTWVIA--
A problem with SP scores: Example • Consider an alignment of N sequences which all have leucine (L) at a certain position. The score of an L aligned to L is 5 (BLOSUM), so the score of the column is 5xN(N-1)/2, where N(N-1)/2 is the number of symbol pairs in the column. • If there were one glycine (G) in the column and N-1 Ls, the score would be 9x(N-1) less, because a G-L pair scores –4 and N-1 pairs are affected. • So, the SP score for a column with one G is worse than the score for a column of all Ls by a fraction of . • Notice the inverse dependent on N: the relative difference in score between the correct alignment and the incorrect alignment decreases with the number of sequences in the alignment. This is counter-intuitive, because the relative difference ought to increase with the more evidence we have for a conserved leucine.
Position Specific Scores • m is a multiple alignment; mi the column of aligned symbols in column i; the symbol in column i for sequence j; • is the observed counts for residue a in column i; where is 1 if and 0 otherwise
Position Specific Scores • Ci the count vector of observed symbols in column i for an alphabet of K different residues
Minimum Entropy Scores • We assume that residues within the column are independent, as well as between columns. • The probability of a column mi is: where pia is the probability of residue a in column i.
Minimum Entropy Scores • We define a column score as: The column score is an entropy measure. A conserved column would score 0. • The maximum likelihood estimate for the paremeter pia is
Simultaneous multiple sequence alignment by Multidimensional dynamic programming Assumptions: - the columns of an alignment are statistically independent - gaps are scored with a linear gap cost γ=gd for a gap of length g and some gap cost d. Note: Extension to affine gap a costs is possible but the formalism becomes tedious. Therefore the overall score for an alignment can be computed as sum of the scores for each column i
Note Using the notation if and if the recursion relation becomes: Complexity In general, if we assume that the sequences are roughly the same length the memory complexity of the (naive) dynamic programming algorithm for multiple sequence alignment is and the time complexity is .
Carillo-Lipman Algorithm(1988) Implementation:MSA by Lipman, Altschul & Kececioglu(1989) • This algorithm reduces the volume of the multidimensional dynamic programming matrix. • MSA can optimally align up to five to seven protein sequences of reasonable length (200-300 residues). • Assumption: the score of a multiple alignments is the sum of the scores of all pairwise alignments defined by the multiple alignment.
The score of a complete alignment a is defined as where denotes the pairwise alignment between sequences k and l. • Let be the optimal pairwise alignment of k, l. Obviously, • Assume that we have a lower boundσ(a) and S(a), the score of the optimal multiple alignment a, i.e
(We can obtain a good bound σ(a) by any fast heuristic multiple alignment algorithm, for instance progressive alignment algorithms, to be introduced in the sequel). • Due to the sum of pairs (SP) score definition, we have: and thus • Therefore we can set a a lower bound on where
The N(N-1)/2 optimum pairwise alignments are each calculated and scored by standard pairwise alignment. • The higher the bounds are, the smaller the volume of multidimensional dynamic programming matrix that must be calculated.
For each pair k, l we can find the complete set of coordinate pairs such that the best alignment of to through scores more than This set is calculated in time by multiplying the forward and backward Viterbi (…!) scores for each cell of the complete pairwise dynamic programming table. • The costly multidimensional dynamic programming algorithm can then restricted to evaluate only cells in the intersection of these sets: i.e. cells for which is in for all k, l.
Progressive Multiple Alignment Methods These (greedy) methods are the most commonly used approach to multiple sequence alignment. The general idea: • Most progressive alignment algorithms build a “guide tree”, a binary tree whose leaves represent sequences and whose interior nodes represent a alignments. (The methods for constructing guide trees can be “quick and dirty” versions of those for phylogenetic trees.)
Progressive Multiple Alignment Methods • Main heuristic: first align the most similar pairs of sequences, using a pairwise alignment method. Then walk up the tree and compute at each interior node the alignment of (alignments of) sequences associated with the direct descendants of that node. • The root node will represent a complete multiple alignment of the input sequences. Note: Progressive alignment methods use no global scoring function of alignment correctness.