creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/

Multiple Alignments & Molecular Evolution Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course:http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/ Course: http://10.100.14.36/Student_Server/

Part I: Multiple Alignments

Pairwise Alignment • We have seen how pairwise alignments are made. • Dynamic programming is an efficient algorithm for finding the optimal alignment. • Break problem into smaller subproblems • Solve subproblems optimally, recursively • Use optimal solutions to construct an optimal solution for the original problem • Alignments require a substitution (scoring) matrix that accounts for gap penalties.

Sub. Matrix: Basic idea • Probability of substitution (mutation)

PAM Matrices • A family of matrices (PAM-N) • Based upon an evolutionary model • The score for a substitution of nucleotides/amino acids is based on how much we expect that substitution to be observed after a certain length of evolutionary time • The scores are derived by a Markov model – i.e., the probability that one amino acid will change to another is not affected by changes that occurred at an earlier stage of evolutionary history

Nucleic acid PAM matrices • PAM = point accepted mutation • 1 PAM = 1% probability of mutation at each sequence position. • A uniform PAM1 matrix for a familiy of closely related proteins:

How did they get the values for PAM-1? • Look at 71 groups of protein sequences where the proteins in each group are at least 85% similar (Why these groups?) • Compute relative mutability of each amino acid – probability of change • From relative mutability, compute mutability probability for each amino acid pair X,Y– probability that X will change to Y over a certain evolutionary time • Normalize the mutability probability for each pair to a value between 0 and 1

Transitions and transversions • Transitions (A  G or C  T) are more likely than transversions (A  T or G  C) • Assume that transitions are three times as likely:

PAM-N Matrices • N is a measure of evolutionary distance • PAM-1 is modeled on an estimate of how long in evolutionary time it would take one amino acid out of 100 to change. That length of time is called 1 PAM unit, roughly 10 million years (abbreviated my). • Values in a PAM-1 matrix show the probability that an amino acid will change over 10 my. • To get the PAM-N matrix for any N, multiply PAM-(N-1) by PAM-1.

Distant relatives • If a family of proteins is say, 80% homologous use a PAM 2.

Computing Relative Mutability – A Measure of the Likelihood that an Amino Acid Will Mutate For each amino acid • changes = number of times the amino acid changed into something else • exposure to mutation = (percentage occurrence of the amino acid in the group of sequences being analyzed) * (frequency of amino acids changes in the group) • relative mutability = (changes/exposure to mutation) / 100

Computing Relative Mutability of A: changes = # times A changes into something else = 4 % occurrence of A in group = 10 / 63 = 0.159 frequency of all amino acid changes in group = 6 * 2 = 12 (Note: Count changes backwards and forwards.) exposure to mutation = (% occurrence of A in group) * (frequency of all amino acid changes in group) = 12 * 0.159 relative mutability = (changes / exposure to mutation) / 100 = (4 / (12 * 0.159)) = 2.09 / 100 = 0.0209 Example from Fundamental Concepts of Bioinformatics by Krane and Raymer.

How can we understand relative mutability intuitively? relative mutability = changes / exposure to mutation = the number of times A changed in proportion to the the probability that it COULD have changed exposure to mutation – that were 6 times when something changed in the tree. Each time, that change could have been A changing to something else, or something else changing to A – 12 chances for a change involving A. But A appears in a sequence only .159 of the time.

Computing Mutability Probability Between Amino Acid Pairs For each pair of amino acids X and Y: r = relative mutability of X c = num times X becomes Y or vice versa p = num changes involving X mutability probability of X to Y = (r * c) / p

Computing Mutability Probability that A will change to G: r = relative mutability of A = .0209 c = num times A becomes G or vice versa = 3 p = num changes involving A = 4 mutability probability of A to G = (r * c) / p = (0.0209 * 3) / 4 = 0.0156

Normalizing Mutability Probability, X to Y • For each Y among all amino acids, compute mutability probability of X to Y as described above • Get a total of these 20 probabilities. Divide them by a normalizing factor such that the probability that X will NOT change is 99% and the sum of probabilities that it will change to any other amino acid is 1% • These are the numbers that go in the PAM-1 matrix!

Converting Mutability Probabilities to Log Odds Score for X to Y • Compute the relative frequency of change for X to Y as follows: • Get the X to Y mutability probability • Divide by the % frequency of X in the sequence data • Convert to log base 10, multiply by 10 • In our example, we get log10(0.0156/0.1587) = log10(.098) • To compute log10(.098) solve for x: • 10x = 0.098 x = -1.01 10-1.01 = 1/101.01 = 0.098 • Compute log odds score for Y to X • Take the average of these two values

Usefulness of Log Odds Scores • A score of 0 indicates that the change from one amino acid to another is what is expected by chance • A negative score means that the change is probably due to chance • A positive score means that the change is more than expected by chance • Because the scores are in log form, they can be added (i.e., the chance that X will change to Y and then Y to Z)

Disadvantages of PAM Matrices • An alignment tree must be constructed first, implying some circularity in the analysis • The original PAM-1 matrix was based on a limited number of families, not necessarily representative of all protein families • The Markov model does not take into account that multi-step mutations should be treated differently from single-step ones

Most Commonly-Used Amino Acid Subtitution Matrices • PAM (Percent Accepted Mutation, also called Dayhoff Amino Acid Substitution Matrix) • BLOSUM (BLOcks amino acid SUbstitution Matrix)

BLOSUM Scoring Matrices • Based on a larger set of protein families than PAM (about 500 families). The proteins in the families are known to be biochemically related. • Focuses on blocks of conserved amino acid patterns in these families • Designed to find conserved domains in protein families • BLOSUM matrices with lower numbers are more useful for scoring matches in pairs that are expected to be less closely related through evolution – e.g., BLOSUM50 is used for more distantly-related proteins than BLOSUM62. (This is the opposite of the PAM matrices.)

BLOSUM Matrices • Target frequencies are identified directly and not by extrapolation • Sequences more than x% identical are collapsed into a single sequence • BLOSUM 50: >=50% Identity • BLOSUM 62: >=62% Identity

Building a BLOSUM Matrix • BLOSUM 62: • Collapse Sequences that have more than 62% identity into one • Calculate probability of a given pair of AAs being in same column (qij) • Calculate the frequency of a given AA (fi) • Calculate log odds ratio sij=log2(qij/fi). This is the value that goes into the BLOSUM matrix

BLOSUM50

What matrix to choose? • BLOSUM Matrices perform better in local similarity searches • BLOSUM 62 is the default matrix used for database searching

Gap Penalty (Gap Scoring)

Gap Penalties • Gaps in the alignment are necessary to increase score. • They must be penalized; however if penalty is to high no gaps will appear • On the other hand if they are too low, gaps everywhere!!! • The default settings of programs are usually ok for their default scoring matrices

Once a gap, can we widen it? >gi|729942|sp|P40601|LIP1_PHOLU Lipase 1 precursor (Triacylglycerol lipase) Length = 645 Score = 33.5 bits (75), Expect = 5.9 Identities = 32/180 (17%), Positives = 70/180 (38%), Gaps = 9/180 (5%) Query: 2038 IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANGKYKSESYTGQ 2097 +++ YGL+ Y+ ++ Y D K +R ++ + N + G+ Sbjct: 441 VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKENASTSGHQWGGR 495 Query: 2098 LMAGYTYMMPENINLTPLAGLRYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157 + AG+ + + +P+ + KGY+E+G + + Y++ G LG ++ Sbjct: 496 ITAGWDIPLTSAVTTSPIIQYAWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD 555 Query: 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARLQGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 +N P ++ F +K I + + + S KQ + +G+ A Sbjct: 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSAINSTQTSFVSESQKQDTHWREYTIGMNA 611 Real gaps are often more than one letter long.

Affine gap penalty LETVGY W----L • Separate penalties for gap opening and gap extension. • This requires modifying the DP algorithm to store three values in each box. -5 -1 -1 -1

Scoring Gap Penalties • Linear Gap Penalty Score • Affine Penalty Score • Opening a gap is costly; extending it not so much (open=12; extension=1)

Multiple Sequence Alignment

MSA Introduction • Goal of protein sequence alignment: • To discover “biological” (structural / functional) similarities • If sequence similarity is weak, pairwise alignment can fail to identify important features (eg interaction residues) • Simultaneous comparison of many sequences often find similarities that are invisible in PA.

Why do we care about sequence alignment? • Identify regions of a gene (or protein) susceptible to mutation and regions where residue replacement does not change function. • Information about the evolution of organisms. • Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species. • Homologous genes (genes with share evolutionary origin) have similar sequences. • Paralogs are evolutionarily related (share an origin) but no longer have the same function. • You can uncover either orthologs or paralogs through sequence alignment.

Multiple Sequence Alignment • Often applied to proteins (not very good with DNA) • Proteins that are similar in sequence are often similar in structure and function • Sequence changes more rapidly in evolution than does structure and function.

Work with proteins!If at all possible — • Twenty match symbols versus four, plus similarity! Way better signal to noise. • Also guarantees no indels are placed within codons. So translate, then align. • Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration.

Overview of Methods • Dynamic programming – too computationally expensive to do a complete search; uses heuristics • Progressive – starts with pair-wise alignment of most similar sequences; adds to that (LOCAL OPTIMIZATION) • Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) (GLOBAL OPTIMIZATION) • Locally conserved patterns • Statistical and probabilistic methods

Dynamic Programming • Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n-dimensional hyperspace (Remember matrix, now add many dimensions) • Can align less than 20 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

A Heuristic for Reducing the Search Space in Dynamic Programming • Consider the pair-wise alignments of each pair of sequences. • Create alignments from these scores. • Consider a multiple sequence alignment built from the individual pairwise alignments. • These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.

The details • Create an “alignment of alignments” (AOA) based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.) • Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the AOA. Most alike sequences aligned first. • Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space. • The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.

Heuristic Dynamic Programming Method for MSA • Does not guarantee an optimal alignment of all the sequences in the group. • Does get an optimal alignment within the space chosen.

Progressive Methods • Similar to dynamic programming method in that it uses the first step (i.e., it creates an AOA, aligns the most-alike pair, and incrementally adds sequences to the alignment.) • Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA.)

Progressive Method: the details • Generally proceeds as follows: • Choose a starting pair of sequences and align them • Align each next sequence to those already aligned, one at a time • Heuristic method – doesn’t guarantee an optimal alignment • Details vary in implementation: • How to choose the first sequence to align? • Align all subsequence sequences cumulatively or in subfamilies? • How to score?

ClustalW • Based on phylogenetic analysis • A AOA is created using a pairwise distance matrix and nearest-neighbor algorithm • The most closely-related pairs of sequences are aligned using dynamic programming • Each of the alignments is analyzed and a profile of it is created • Alignment profiles are aligned progressively for a total alignment • W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the AOA

ClustalW Procedure AOA

“Once a gap, always a gap”

Basic Steps in Progressive Alignment “Once a gap, always a gap”

creativecommons/licenses/by-sa/2.0/