690 likes | 796 Views
Chapter 2. Significance of Scores. We have known how to find an optimal alignment, how can we assess the significance of its score? How do we decide if it is a biologically meaningful alignment given evidence for a homology, or just the best alignment between two entirely unrelated sequences
E N D
Significance of Scores • We have known how to find an optimal alignment, how can we assess the significance of its score? • How do we decide if it is a biologically meaningful alignment given evidence for a homology, or just the best alignment between two entirely unrelated sequences • Two approaches: • The Bayesian approach • The classical approach
Similarity Score Significance • We saw last time how the alignment score is a log likelihood of H vs R • Score = log P[ X,Y | M ] / P [ X,Y | R ] • M == homology simulator • R == random sequence simulator • Score > 0 => evidence for H • Score < 0 => evidence for R • Is a score of 1 convincing evidence of homology? • What about 5, 10, 15, or 20? • We need some notion of “scale” for the score axis, some measure of confidence.
Similarity Score Significance • Want P [ M | X,Y ]! • Are the two sequences, X and Y, homologous? • Bayes’ Theorem to the rescue! • Plus some other probability identities... • P[X|Y] = P[Y|X] P[X] / P[Y] • P[Y] = P[Y|A] P[A] + P[Y|B] P[B] for partition {A,B}.
The Bayesian Approach–Model Comparison • We want to derive the probability that the sequences are related as opposed to being unrelated, P(M|x,y) • Let P(M) be the prior probability that the sequences are related, and hence that the match model is correct • P(R)=1-P(M) represents the prior probability that the random model is correct
Let where is the log-odds score of the alignment. Then where σ(x) is known as the logistic function.
σ(x) tends to 1 as x tends to infinity, to 0 as x tends to minus infinity, and with value ½ at x=0 • The logistic function is widely used in neural network analysis to convert scores built from sums into probabilities–not entirely a coincidence
Similarity Score Significance • Bayesian: Our new understanding is based on our observation, plus whatever else we know. • Suppose we know (or believe) that a database of (N+1) proteins contains 1 protein homologous to our query P. • P[M] = 1/(N+1), P[R] = N/(N+1),prior odds ratio = 1/N. • P [ M | X,Y ] = σ( S - log N ) • Now need a higher score than before!
From (2.17), we just add the prior log-odds ratio, log(P(M)/P(R)) to the standard score of the alignment • This corresponds to multiplying the likelihood ratio by the prior odds ratio • The prior odds ratio becomes important when we are looking at a large number of different alignments for a possible significant match • A typical situation is when searching a database
Hypothesis Testing • Should we believe match? P(M|x,y) > P(R|x,y) score > –log prior odds ratio • That is, compare the score against a threshold (the log prior odds)
If we have a fixed prior odds ratio, then even if all the database sequences are unrelated, as the number of sequences we try to match increases, the probability of one of the matches looking significant by chance will also increase • Given a fixed prior odds ratio, the expected number of (falsely) significant observations will increase linearly • If we want it to stay fixed, then we must set the prior odds ratio in inverse proportion to the number of sequences in the database N
Significant Matches • If we want it to stay fixed, then we must set the prior odds ratio in inverse proportion to the number of sequences in the database N P(M)/P(R) 1/N –log prior odds ratio log(N)
Similarity Score Significance • Determining an appropriate prior log likelihood for the Bayesian analysis requires two pieces: • knowledge of homologies in database • model of non-homologies/random alignments • Classical/frequentist approach: • Show that it is very unlikely to be random • Reject the null hypothesis......that random alignment is plausible
Similarity score significance • Lets start out simple: • ungaped global alignments, • scoring model: match: +1, mismatch: -1 • Score S of length n alignment, under R?Each position: 1 with prob. ¼ -1 with prob. ¾Each position independent. • Alignment score S ~ -n + 2*Binom(¼,n)
Similarity score significance • ER(S) = -n/2, VarR(S) = ¾ n • For large enough n, behaves like normal distribution • So S ~ Normal(-n/2, √(¾ n) ). • PR[S > score] can be computed from normal distribution tables... • Example: • alignment of length 300 with score 120 • P [ N(-150,15) > 120 ] ≈ 1x10-73
Similarity score significance • However, we are rarely considering just one alignment. • Suppose we have a database of N proteins to compare against query P • What is probability that the best of N random alignments scores at least S? • Given cdf F(x) = PR[ score ≤ x ], and independent alignments, P [ all N alignments score ≤ S ] = F(S)N
Similarity score significance • We want prob. at least one alignment is > S: PR[ max of N scores > S ] = 1 – F(S)N • Alternative approach:PR[ at least 1 score > S ] = PR[ 1st score > S or 2nd score > S or ... ] ≤ΣN PR[ score > S ] = N(1 - F(S)) • Doesn’t assume independence...
The Classical Approach–The Extreme Value Distribution • We can look at the distribution of the maximum of N match scores to independent random sequences • If the probability of this maximum being greater than the observed best score is small, then the observation is considered significant
For a fixed ungapped alignment, the score of a match to a random sequence is the sum of many similar random variables, and so will be very well approximated by a normal distribution • The asymptotic distribution of the maximum MNof a series of N independent normal random variables is known, and has the form for some constants K, λ.
This form of limiting distribution is called the extreme value distribution or EVD • We can use equation (2.18) to calculate the probability that the best match from a search of a large number N of unrelated sequences has score greater than our observed maximal score, S • If this is less than some small value, such as 0.05 or 0.01, then we can conclude that it is unlikely that the sequence giving rise to the observed maximal score is unrelated, i.e. it is likely that it is related
Local Alignment • Normal approximation not appropriate for local alignments • We need max of n x m local alignments • Karlin-Altschul theory determines EVD parameters in terms of n, m, and score matrix.
Karlin-Altschul theory Assumes: • At least one alignment score is positive • Expected scores are negative • Characters of sequences are i.i.d. • No gaps We assume, for simplicity, log likelihood s(x,y)
Karlin-Altschul theory • The number of unrelated matches with score greater than S is approximately Poisson distributed, with mean where λ is the positive root of and K is a constant given by a geometrically convergent series also independent only on the qa and s(a,b).
Karlin-Altschul theory • K compensates for lack of independence of nearby local alignments • Number of local alignments with score ≥ S is Poisson distributed P[ k local alignments ≥ S ] = e-E Ek / k! • P[ at least one local alignment ≥ S ] = p-value = 1 - e-E • When E < 0.01, e-value and p-value are essentially the same.
Similarity score significance • Karlin-Altschul doesn’t extend to gapped scoring models... • ...but simulation suggests the same approach works. • As with Bayesian approach, correct for “number of independent trials” • some fraction of nm.
Summary • Significance of local alignment similarity score depends on: • Score matrix, length of query, database • Bayesian approach: • determine P [ M | X,Y ] • need prior log likelihood for M vs R • Frequentist approach: • determine PR[ max score > S ] • need cdf F(x) for score function, or • EVD for P [ max score > s ]
Correcting for Length • The chance of getting a high local alignment score against a longer sequence is higher • More possible start points • Decrease score by log of target sequence length • Simplistic, as it assumes independence between local matches to a given target
Estimating Substitution Costs • Basic idea for BLOSUM: count frequency of ungapped alignment pairs in database • Problem: bias from close homologs within protein families • Solution: cluster sequences with more than L% identical residues
BLOSUM Details • Counting residue pairings • Substitution score
Chapter 3 Markov Chains and Hidden Markov Models
CpG Islands • In humans, the CG dinucleotide is special. • C is often methylated, and then mutated to T, so CG dinucleotides are rarer than their individual frequencies would suggest. • However, methylation is suppressed in various places, such as prior to the transcription start site of a gene. • CpG islands: regions of over-represented CG dinucleotide pairs.
Dinucleotide Probability Model • Need to represent the probability of seeing dinucletides within a sequence • Each symbol depends on previous symbol. • Markov chain for DNA sequence: • Node for A,C,G,T. • Edge from every node to every node • Edges represent probability of next symbol, given previous symbol.
Markov Chain for DNA • ast = P[ xi = t | xi-1 = s ] • For any sequence of symbols x of length L, P[x] = P[xL, xL-1, ..., x2, x1] = P[xL|xL-1, ..., x2, x1]. P[xL-1|xL-2, ..., x2, x1]. ... P[x2|x1]. P[x1] = P[xL|xL-1].P[xL-1|xL-2]. ... .P[x1] = P[x1] πi P[xi|xi-1].
Markov Chain for DNA ast = P[ xi = t | xi-1 = s ], P[x] = P[x1]πi P[xi|xi-1] aAA aCC aAC A C aCA aCG aAT aAG aGA aTC aCT aTA aGC aGT G T aTG aGG aTT
Markov Chain for DNA • Add “Begin” node B that points to each DNA node, • Add “End” node E with an edge from each DNA node. • Transitions aBA, ..., aBT capture initial state probability P[x1]. P[B] = 1. • Transitions aAE, ..., aTE represent probability sequence ends.
Markov Chain Discriminators • Suppose we are given a set of CpG islands from the human genome. • Count all pairs of nucleotides (c+st) and compute a+st = c+st / Σt’ c+st’ • Count all pairs of nucleotides in non-CpG island sequence and a-st in the same way. • These are the maximum likelihood estimates of the transition probabilities.
Table properties • Rows sum to 1. • Asymmetric • Omitted begin and end nodes for brevity.
Log-odds ratio of CpG Island S(x) = log P[x|+]/P[x|-] = Σi log (a+i-1,i / a-i-1,i ) = Σiβi-1,i
Summary • Markov chains: • States (representing “memory”) • Transition probabilities • Begin and end states • Generate sequence according to model • Compute probability of sequence according to model • Transition probabilities extracted from “training” data.
Finding CpG Islands • The log likelihood score function S(x) =Σiβi-1,iis a good discriminator. • But how to find CpG islands in a lot of non-CpG island sequence? • Could just test every 100 bp window... • ...we can do better.
Finding CpG islands Simple Minded approach: • Pick a window of size N(N = 100, for example) • Compute log-ratio for the sequence in the window, and classify based on that Problems: • How do we select N? • What do we do when the window intersects the boundary of a CpG island?
Finding CpG Islands • Simulator for sequence that contains CpG islands: • Generate sequence from the non-CpG island generator • With small probability, switch to CpG island generator • Generate sequence from the CpG island generator • With small probability, switch back to non-CpG island generator, • Repeat!
Finding CpG Islands • States A+, C+, G+, T+ represent CpG islands, emit A,C,G,T with CpG island dinucleotide frequencies. • States A-, C-, G-, T- represent non-CpG islands, emit A,C,G,T with non-CpG island dinucleotide frequencies. • Probability of transition between “+” states and “-” states is small. • “+” to “-” probability greater than “-” to “+” probability so more non-CpG island sequence.
C-G Islands: A DifferentHMM Define C-G islands: DNA stretches which are very rich in CG q/4 P q/4 G q A P q Regular DNA change q P P q q/4 C T p/3 q/4 p/6 G A (1-P)/4 (1-q)/6 (1-q)/3 p/3 P/6 C-G island C T aka CpG islands