CAP5510 – Bioinformatics Substitution Patterns

CAP5510 – BioinformaticsSubstitution Patterns Tamer Kahveci CISE Department University of Florida

Goals • Understand how mutations occur • Learn models for predicting the number of mutations • Understand why scoring matrices are used and how they are derived • Learn major scoring matrices

Why Substitute Patterns ? • Mutations happen because of mistakes in DNA replication and repair. • Our genetic code changes due to mutations • Insert, delete, replace • Three types of mutations • Advantageous • Disadvantageous • Neutral • We only observe substitutions that passed selection process

Mutation Rates Parent Organism R = K/(2T) T time Organism A Organism B K: number of substitutions

Functional Constraints • Functional sites are less likely to mutate • Noncoding = 3.33 (subs/109 yr) • Coding = 1.58 (subs/109 yr) • Indels about 10 times less likely than substitutions

Nucleotide Substitutions and Amino Acids • Synonymous substitutions do not change amino acids • Nonsynonymous do change • Degeneracy • Fourfold degenerate: gly = {GGG, GGA, GGU, GGC} • Twofold degenerate: asp = {GAU, GAC}, glu = {GAA, GAG} • Non-degenerate: phe = UUU, leu = CUU, ile = AUU, val = GUU • Example substitution rates in human and mouse • Fourfold degenerate: 2.35 • Twofold degenerate: 1.67 • Non-degenerate: 0.56

Predicting Substitutions How can we count the true number of substitutions ?

x A C x x G T Jukes-Cantor Model • Each nucleotide can change into another one with the same probability P(A->A’, 1) = x, for each A’ P(A->A, 1) = 1 – 3x Compute P(A->A’, 2) & P(A->A, 2) P(A->A, t+1) = 3 P(A->A’, t) P(A’->A, 1) + P(A->A, t) P(A->A, 1) P(A->A, t) ~ ¼ + (3/4)e-4ft K = num. subst. = -¾ ln(1 – f4/3), f = fraction of observed substitutions Oversimplification

Two Parameter Model • Transition: • purine->purine (A, G), pyrimidine->pyrimidine (C, T) • Transversion: • purine <-> pyrimidine • Transitions are more likely than transversions. • Use different probabilities for transitions and transversions. Purine Pyrimidine

y A C y x G T Two Parameter Model P(AA,2) = (1-x-2y) P(AA,1) + x P(AG,1) + y P(AC,1) + y P(AT,1) P(AA,t) = ¼ + ¼ e-4yt + ½ e-2(x+y)t K = ½ ln(1/(1-2P-Q)) + ¼ ln(1/(1-2Q)) P,Q: fraction of transitions and transversions observed. • P(AA,1) = 1-x-2y • Compute P(AA,2)

More Parameters ? • Assign a different probability for each pair of nucleotides • Not harder to compute than simpler models • Not necessarily better than simpler models

Amino Acid substitutions (1) • Harder to model than nucleotides • An amino acid can be substituted for another in more than one ways • The number of nucleotide substitutions needed to transform one amino acid to another may differ • Pro = CCC, leu = CUC, ile = AUC • The likelihood of nucleotide substitutions may differ • Asp = GAU, asn = AAU, his = CAU • Amino acid substitutions may have different effects on the protein function

Amino Acid substitutions (2) • Mutation rates may vary greatly among genes • Nonsynonymous substitution may affect functionality with smaller probability in some genes • Molecular clock (Zuckerlandl, Paulding) • Mutation rates may be different for different organisms, but it remains almost constant over the time.

Scoring Matrices

What is it & why ? • Let alphabet contain N letters • N = 4 and 20 for nucleotides and amino acids • N x N matrix • (i,j) shows the relationship between ith and jth letters. • Positive number if letter i is likely to mutate into letter j • Negative otherwise • Magnitude shows the degree of proximity • Symmetric

The BLOSUM45 Matrix A R N D C Q E G H I L K M F P S T W Y V A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3 H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5

Scoring Matrices for DNA Transitions & transversions BLAST identity

Scoring Matrices for Amino Acids • Chemical similarities • Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P) • Polar, Hydrophilic (S, T, C, Y, N, Q) • Electrically charged (D, E, K, R, H) • Requires expert knowledge • Genetic code: Nucleotide substitutions • E: GAA, GAG • D: GAU, GAC • F: UUU, UUC • Actual substitutions • PAM • BLOSUM

Scoring Matrices: Actual Substitutions • Manually align proteins • Look for amino acid substitutions • Entry ~ log(freq(observed)/freq(expected)) • Log-odds matrices

PAM Matrices (Dayhoff 1972)

PAM • PAM = “Point Accepted Mutation” interested only in mutations that have been “accepted” by natural selection • An accepted mutation is a mutation that occurred and was positively selected by the environment; that is, it did not cause the demise of the particular organism where it occurred.

Interpretation of PAM matrices • PAM-1 : one substitution per 100 residues (a PAM unit of time) • “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?” • PAM-K : K PAM time units

PAM Matrices (1) • Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous • Compute the relative mutability, mi, of each amino acid • e.g. mA = how many times was alanine substituted with anything else on the average?

Relative Mutability • ACGCTAFKIGCGCTAFKIACGCTAFKLGCGCTGFKIGCGCTLFKIASGCTAFKLACACTAFKL • Across all pairs of sequences, there are 28A  X substitutions • There are 10 ALA residues, so mA = 2.8

Pam Matrices (2) • Construct a phylogenetic tree for the sequences in the alignment FG,A = 3 • Calculate substitution frequencies FX,X • Substitutions may have occurred either way, so A  G also counts as G  A.

´ 2 . 8 3 = M G , A 4 Mutation Probabilities • Mi,j represents the probability of J  I substitution. = 2.1

The PAM Matrix • The entries of the scoring matrix are the Mi,j values divided by the frequency of occurrence, fi, of residue i. • fG = 10 GLY / 63 residues = 0.1587 • RG,A = log(2.1/0.1587) = log(12.760) = 1.106 • Log-odds matrix • Diagonal entries are Mjj = 1–mj

Computation of PAM-K • Assume that changes at time T+1 are independent of the changes at time T. • Markov chain • P(A-->B) = X P(A->X) P(X->B) • PAM-K = (PAM-1)K • PAM-250 is most commonly used

PAM - Discussion • Smaller K, PAM-K is better for closely related sequences, large K is better for distantly related sequences • Biased towards closely related sequences since it starts from highly similar sequences (BLOSUM solves this) • If Mi,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified. • Mutation rate may change from one gene to another

BLOSUM Matrices Henikoff & Henikoff 1992

BLOSUM Matrix • Begin with a set of protein sequences and obtain blocks. • ~2000 blocks from 500 families of related proteins • More data than PAM • A block is the ungapped alignment of a highly conserved region of a family of proteins. • MOTIF program is used to find blocks • Substitutions in these blocks are used to compute BLOSUM matrix block 1 block 2 block 3 WWYIR CASILRKIYIYGPV GVSRLRTAYGGRKNRG WFYVR … CASILRHLYHRSPA … GVGSITKIYGGRKRNG WYYVR AAAVARHIYLRKTV GVGRLRKVHGSTKNRG WYFIR AASICRHLYIRSPA GIGSFEKIYGGRRRRG

Constructing the Matrix • Count the frequency of occurrence of each amino acid. This gives the background distribution pa • Count the number of times amino acid a is aligned with amino acid b: fab • A block of width w and depth s contributes ws(s-1)/2 = np pairs • Compute the occurrence probability of each pair • qab = fab/ np • Compute the probability of occurrence of amino acid a • pa = qaa + Σ qab /2 • Compute the expected probability of occurrence of each pair • eab = 2papb, if a ≠ b papb otherwise • Compute the log likelihood ratios, normalize, and round. • 2* log2 qab / eab i a≠b

Constructing the Matrix: Example • fAA = 36, fAS = 9 • Observed frequencies of pairs • qAA = fAA/(fAA+fAS) = 36/45 = 0.8 • qAS = 9/45 = 0.2 • Expected frequencies of letters • pA = qAA + qAS/2 = 0.9 • pS = qAS/2 = 0.1 • Expected frequencies of pairs • eAA = pA x pA = 0.81 • eAS = 2 x pA x pS = 0.18 • Matrix entries • MAA = 2x log2(qAA/eAA) = -0.04 ~ 0 • MAS = 2 x log2(qAS/eAS) = 0.3 ~ 0 A A A A S … A … A A A A 9A, 1S

Computation of BLOSUM-K • Different levels of the BLOSUM matrix can be created by differentially weighting the degree of similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to sum to one. In this way the contributions of multiple entries of closely related sequences is reduced. • Larger numbers used to measure recent divergence, default is BLOSUM62 a b

BLOSUM 62 Matrix Check scores for M I L V -small hydrophobic N D E Q -acid, hydrophilic H R K -basic F Y W -aromatic S T P A G -small hydrophilic C -sulphydryl

PAM vs. BLOSUM Equivalent PAM and BLOSSUM matrices: PAM100 = Blosum90 PAM120 = Blosum80 PAM160 = Blosum60 PAM200 = Blosum52 PAM250 = Blosum45 BLOSUM62 is the default matrix to use.

PAM vs. BLOSUM

CAP5510 – Bioinformatics Substitution Patterns