Dayhoff Model:

Dayhoff Model: Accepted Point Mutation (PAM) Arthur W. Chou Spring 2008 Clark University

Dr. Margaret Oakley Dayhoff (1925-1983)

PhD in Chemistry, Columbia University, 1947 • Watson Computing Laboratory Fellow 1947 - 48 • Atlas of Protein Sequence and Structure 1965 - 1978 • Protein Sequence Database

PAM Score Matrix (1978) Log-odds matrix for PAM250

Dayhoff’s 34 protein superfamilies ProteinPAMs per 100 million years Ig kappa chain 37 Kappa casein 33 Lactalbumin 27 Hemoglobin a 12 Myoglobin 8.9 Insulin 4.4 Histone H4 0.10 Ubiquitin 0.00

Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? Number of accepted point mutations, multiplied by 10 (Dayhoff 1978)

Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

The relative mutability of amino acids Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18

Normalized frequencies of amino acids Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon

Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins?

Dayhoff’s PAM1 mutation probability matrix

Mutation counts be the number of mutations ab, be the total number of mutations that involve a, be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations. Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutationis a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Let pa = na/n where na is the number of occurrences of letter a and n is the total number of letters in the collection, so n = ana.

We define Mab, such that only 1% of amino acids change according to this matrix or 99% don’t. Hence the name, 1-Percent Accepted Mutation (PAM). In other words, PAM-1 matrices Define Mab to be the symmetric probability matrix for switching between a and b. We set, Maa = 1 – ma, so that ma is the probability that a is involved in a change.

We select K to satisfy the PAM-1 definition: PAM-1 matrices We wish that ma will be proportional to the relative mutability of letter a compared to other letters. where K is a proportional constant. So K=100 for PAM-1 matrices. Note that K=50 yields 2% change, etc.

Evolutionary distance The choice that 1% of amino acids change (and that K =100) is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated. This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ?

A T T C T A C C G G Model of Evolution We make some assumptions: • Each position changes independently of the rest • The probability of mutations is the same in each position • Evolution does not “remember” Time t+ t+2 t+3 t+4 t

Model of Evolution • How do we model such a process? • This process is called a Markov Chain A chain is defined by the transition probability • P(Xt+ =b|Xt=a) - the probability that the next state is b given that the current state is a • We often describe these probabilities by a matrix:M[]ab =P(Xt+ =b|Xt=a)

Using Conditional independence (No memory) Multi-Step Changes • Based on Mab, we can compute the probabilities of changes over two time periods • Thus M[2] = M[]M[] • By induction: M[n] = M[] n

X1 X2 Xn-1 Xn A Markov Model (chain) • Every variable xi has a domain. For example, suppose the domain are the letters {a, c, t, g}. • Every variable is associated with a local probability table • P(Xi = xi | Xi-1= xi-1 ) and P(X1 = x1 ). • The joint distribution is given by

M M X1 X2 Xn-1 Xn The quantity we computed earlier from this model was the joint probability table Markov Model of Evolution Revisited In the evolution model we studied earlier we had P(x1) = (pa, pc, pg, pt) which sum to 1 and called the prior probabilities, and P(xi|xi-1) = M[] which is a stationary transition probability table, not depending on the index i.

Longer Term Changes • Estimate M[]= M (PAM-1 matrices) • Use M[n] = Mn (PAM-n matrices) • Define • Use this quantity to define the score for your application of interest.

PAM250 mutation probability matrix Top: original amino acid Side: replacement amino acid

PAM250 log odds scoring matrix

Why do we go from a mutation probability matrix to a log odds matrix? • We want a scoring matrix so that when we do a pairwise • alignment (or a BLAST search) we know what score to • assign to two aligned amino acid residues. • Logarithms are easier to use for a scoring system. They • allow us to sum the scores of aligned residues (rather • than having to multiply them).

How do we go from a mutation probability matrix to a log odds matrix? • The cells in a log odds matrix consist of an “odds ratio”: • the probability that an alignment is authentic • the probability that the alignment was random • The score S for an alignment of residues a,b is given by: • S(a,b) = 10 log10 ( Mab / pb ) • As an example, for tryptophan, • S( W, W ) = 10 log10 ( 0.55 / 0.01 ) = 17.4

What do the numbers mean in a log odds matrix? S( W, W ) = 10 log10 ( 0.55 / 0.010 ) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two tryptophan residues. S(W, W) = 17 Probability of replacement ( Mab / pb ) = x Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50

What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids.

PAM10 log odds scoring matrix

Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** A PAM250 matrix is very tolerant of mismatches. 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** **

Comments regarding PAM • Historically researchers use PAM-250. (The only one published in the original paper.) • Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. • Used to be the most popular scoring rule, but there are some problems with PAM matrices.

Degrees of freedom in PAM definition With K=100 the 1-PAM matrix is given by With K=50 the basic matrix is different, namely: Thus we have two different ways to estimate the matrix M[4] : Use the 1-PAM matrix to the fourth power: M[4] = M[] 4 Or Use the K=50 matrix to the second power: M[4] = M[2] 2

Problems in building distance matrices • How do we find pairs of aligned sequences? • How far is the ancestor ? • earlier divergence  low sequence similarity • later divergence  high sequence similarity E.g., M[250] is known not reflect well long period changes. • Does one letter mutate to the other or are they both mutations of a third letter ?

BLOSUM (BLOcks SUbstitution Matrix) • Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. It uses 2000 conserved blocks from 500 families. • Procedure: • Cluster together sequences in a family whenever more than L% identical residues are shared, for BLOSUM-L. • Count number of substitutions across different clusters (in the same family). • Estimate frequencies using the counts. • Practice: BlOSUM-50 and BLOSOM62 are widely used. Considered the state of the art nowadays.

BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

Blosum62 scoring matrix

BLOSUM Matrices 100 collapse 62 Percent amino acid identity 30 BLOSUM62

BLOSUM Matrices 100 100 100 collapse collapse 62 62 62 collapse Percent amino acid identity 30 30 30 BLOSUM80 BLOSUM62 BLOSUM30

Rat versus mouse RBP Rat versus bacterial lipocalin

Dayhoff Model:

Dayhoff Model:

Presentation Transcript

model-model pembelajaran