1 / 27

Computational Genomics Lecture #3a (revised 24/3/09)

This lecture discusses scoring functions in computational genomics, focusing on the concepts of global and local alignment, and the probabilistic modeling and interpretation of scoring functions.

Download Presentation

Computational Genomics Lecture #3a (revised 24/3/09)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro to ML and Scoring functions PAM and BLOSUM AA scoring matrices Computational GenomicsLecture #3a (revised 24/3/09) Background Readings: Section 2.7, 2.8 in Biological Sequence Analysis, Durbin et al., 2001. Sections 15.7-15.11 in Algorithms on Strings, Trees, and Sequences, Gusfield, 1997. This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. .

  2. Scoring Functions • So far, we discussed dynamic programming algorithms for • global alignment • local alignment • other, related versions • All of these assumed a scoring function: that determines the value of substitutions, insertions, and deletions.

  3. Where does the scoring function come from ? An additive scoring function is defined by specifying a function ( ,  ) such that • (x,y) is the score of replacingx by y (including x=y) • (x,-) is the score of deletingx • (-,x) is the score of insertingx But how do we come up with the “correct” score ? Idea: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

  4. Why probability setting is appropriate to define and interpret a scoring function ? • Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are random events. • We could answer questions such as: • How probable it is for two sequences to be similar? • Is the similarity found significant or spurious? • How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

  5. A Probabilistic Model • For starters, will focus on alignment without indels (implying aligned sequences are equal length). • For now, we assume each position (nucleotide /amino-acid) is independent of other positions. This is not a very realistic assumption, BUT itmakes our life a lot easier, and is a good start. • We consider two options: M: the sequences are Matched(related) R: the sequences are Random (unrelated)

  6. Unrelated Sequences (R) • Our random model R of unrelated sequences is simple • Each position is sampled independently from a fixed distribution q()over the alphabet  • We assume there is a distribution q() that describes the probability of single letters. • Then:

  7. Related Sequences (M) • We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor • Let p(a,b) be a distribution over pairs of letters. • p(a,b) is the probability that some ancestral letterevolved into this particular pair of letters. Compare to:

  8. Odd-Ratio Test for Alignment If Q > 1, then the two strings s and t are more likely to be related (M) than unrelated (R). If Q < 1, then the two strings s and t are more likely to be unrelated (R) than related (M).

  9. Log Odd-Ratio Test for Alignment Score(s[i],t[i]) Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. (usually we want some constant positive threshold to “declare” relatedness). How can we relate this quantity to a scoring function ?

  10. Probabilistic Interpretation of Scores • We define the scoring function via • Then, the score of an alignment is the log-ratio between the two models: • Score > 0 Model is more likely • Score < 0 Random is more likely

  11. Probabilistic Interpretation of Scores • We’ve defined the scoring function via • For example, suppose q(a)=0.1, q(b)=0.1, f(a,b)=0.001, f(a,a)=0.08, f(b,b)=0.07. • Then score (a,b)=log 0.1 < 0; score (a,a)=log 8 > 0; score (b,b)=log 7 > 0; • Logarithms are convenient in order to have an additive scoring function under assumption of independence (log of product = sum of logs).

  12. Modeling Assumptions • It is important to note that this interpretation depends on our modeling assumption!! • For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ratio will have a different form. • If we assume, for proteins, some joint distribution of letters that are nearby in 3D space after protein folding, then likelihood ratio will again be different.

  13. Estimating Probabilities • Suppose we are given a long string s[1..n] of letters from  • We want to estimate the single letters distribution, q(·), that best corresponds to the sequence • How should we go about this? The theory of parameter estimation in statistics Deals with this problem in detail. But in our case, there is no need for heavy tools.

  14. Estimating q() • Suppose we are given a long string s[1..n] of letters from  • s is the concatenation of many sequences in our database • We want to estimate the distribution q() • That is, q is defined per single letters Likelihood function:

  15. Estimating q()(cont.) How do we define q? Likelihood function: ML parameters (Maximum Likelihood) Namely q(a) issimply the observed frequency of a in the sequence

  16. Estimating p(·,·) Intuition: • Find pair of aligned sequences s[1..n], t[1..n], • Estimate probability of pairs: • The sequences s and t are be the concatenation of many aligned pairs of sequences from the database Number of times a is aligned with b in (s,t)

  17. Problems in Estimating p(·,·) • How do we find pairs of aligned sequences? • How far is the ancestor of the aligned sequences? • earlier divergence  low sequence similarity • recent divergence  high sequence similarity • Does one letter mutate to the other or are they both mutations of a common ancestor having yet another residue/nucleotide acid ?

  18. Definition: An accepted point mutationis an observed substitution in a gapless alignment of closely related (homologous) protein sequences. For example, Hemoglobin alpha chain in humans and other mammalians. Note that we cannot claim such mutation actually occurred (if we observe an A to B substitution, the “evolutionary truth” could be A to C to B (one substitution, two hidden mutations). Estimating p(·,·) for amino acids

  19. PAM-1 matrices (M. Dayhoff, 1970) We take pairs of sequences that are no more than “1 unit of evolutionary distance apart”. This is interpreted as having substitutions in 1 percent (1/100) of the aligned sites (all other are matches, as we assumed no gaps). In this ensemble of alignments, we count the frequencies of single letters, q(), and the frequencies of substitutions, p(·,·). The(a,b) entry in the PAM1 matrix is then

  20. PAM-1 matrices Values in matrix usually multiplied by 10 & rounded. Example: A portion of PAM1 matrix • A R N D C Q E • A 7 -15 -11 -11 -14 -12 -9 • R -15 9 -14 -27 -16 -9 -26 • N -11 -14 9 -5 -28 -11 -10 • D -11 -27 -5 9 -32 -10 -4 • C -14 -16 -28 -32 10 -31 -31 • Q -12 -9 -11 -10 -31 9 -6 • E -9 -26 -10 -4 -31 -6 9 Most off diagonal entries are negative, but not all. For example score(B,N)=score(B,D)=7 (out of our small portion).

  21. PAM-n matrices Generalize PAM1: Take pairs of sequences that are no more than “n unit of evolutionary distance apart”. Interpret this as having substitutions in n percent (n/100) of the aligned sites (all other are matches, as we assumed no gaps). Nice idea, but does not quite work. Sequences with more than 3 or 4 percent substitutions cannot be aligned without gaps. In addition, we may want evolutionary units that are larger than 100 (e.g. PAM250 exists and useful).

  22. PAM-n matrices Let be the conditional probability that in 1 evolutionary time unit, an A mutates to a B. In our terminology, What is then the probability that in n evolutionary time units, an A mutates to a B ? We should start with an A (an event whose probability is q(A) ), and then go through all possible paths that lead from A to B in n steps.

  23. PAM-n matrices We should start with an A (an event whose probability is q(A) ), and then go through all possible paths that lead from A to B in n steps. The probability of such process is given exactly by the A,B entry of the n-th power of the matrix M. Therefore, the probability of starting with A and moving to B in n steps equals q(A)Mn(A,B).

  24. PAM-n matrices Therefore, the probability of starting with A and moving to B in n steps equals q(A)Mn(A,B). Recall the random (unrelated sequences) model, where the probability of having S[i]=A and T[i]=B equals q(A) q(B). So PAMn(A,B), which is the log odds score, equals log( q(A)Mn(A,B) / q(A)q(b) ) = log(Mn(A,B) / q(b) ) Notice that for n=1 this is the same as PAM1.

  25. PAM-250 matrices Values in matrix usually multiplied by 10 & rounded. Example: A portion of PAM250 matrix • A R N D C Q E • A 2 -2 0 0 -2 0 0 • R -2 6 0 -1 -4 1 -1 • N 0 0 2 2 -4 1 1 • D 0 -1 2 4 -5 2 3 • C -2 -4 -4 -5 12 -5 -5 • Q 0 1 1 2 -5 4 2 • E 0 -1 1 3 -5 2 4 Many more off diagonal entries are positive (compare to PAM1).

  26. The BLOSUM substitution matrices BLOSUM stands for BLOcks Substitution Matrices. The basic principles are similar to PAM, but the sequences used to compute substitution frequencies differ. BLOSUM uses a database containing multiple sequence alignments from related (by function) but possibly remote (evolutionary) proteins. Conserved blocks (local alignments) are detected, and substitution frequencies are computed based on them. BLOSUMn reflects more conserved sequences for higher values of n.

  27. BLOSUM vs. PAM BLOSUM 62 is the default matrix in BLAST 2.0.

More Related