230 likes | 429 Views
(Alignment) Score Statistics. Motivation. Reminder: Basic motivation: we want to check if 2 sequences are “related” or not We align 2 sequences and get a score ( s ) which measures how similar they are Given s, do we accept the hypothesis the 2 are related or reject it ?.
E N D
(Alignment) Score Statistics cbio course, spring 2005, Hebrew University
Motivation • Reminder: • Basic motivation: we want to check if 2 sequences are “related” or not • We align 2 sequences and get a score (s) which measures how similar they are • Given s, do we accept the hypothesis the 2 are related or reject it ? How high should s be so that we “believe” they are related ?? cbio course, spring 2005, Hebrew University
Motivation (2) • We need a “rigorous way” to decide on a threshold s* so that for s > s*, we call the sequences “related” • Note: • s* should obviously be s*(n,m) where n and m are the length of the 2 sequences aligned • When we try matching sequence x against a D.Bof N (N>>1) sequences, we need to account for the fact we might see high scores “just by chance” • We can make 2 kinds of mistakes in our calls: • FP • FN → We want our“rigorous way”to control FP FN mistakes cbio course, spring 2005, Hebrew University
Motivation (3) • The problem of assigning statistical significance to scores and controlling our FP and FN mistakes is of general interest. • Examples: • Similarities between protein sequnece to profile HMM • Log ratio scores when searching DNA sequence motifs • …. • The methods we develop now will be of general use cbio course, spring 2005, Hebrew University
Reminder • In the last lesson we talked about 2 ways to analyze alignment scores and their significance: • Bayesian • Classical EVD approach • We reviewed how the amount of FP mistakes can be controlled using each of these approaces • We reviewed Karlin & Altshul (1990) results cbio course, spring 2005, Hebrew University
Review First Approach – Bayesian Assume we have two states in our world : M (Model = related sequences) R (Random = un realated sequences) Given a fixed alignment of two sequences (x,y) we ask “from which state it came from M or R ?” We saw: Where: cbio course, spring 2005, Hebrew University
Review Bayesian Approach cont. • We saw that in order to control the expected number of false identifications, when testing scores which came from R, we need the threshold over the scores S* to have S* ~ log(number of trials * K ) Where: Number of trials for scoring a sequence of length m in local aligment against N sequences of length n is nmN K in [0,1] is correlation factor compensating for the fact the trials are correlated. cbio course, spring 2005, Hebrew University
Review EVD Approach • In the EVD approach we are interested in the question: “given a score s for aligning x and y, If this s came from a distribution of scores for unrelated sequences (like R in the Bayesian approach), What’s the probability of seeing a score as good as s by chance, simply because I tried so many matches of sequences against x”? • R here is the null hypothesis we are testing against. • If P(score >= s | we tried N scores) < Threshold (say 0.01) then we “reject” the null hypothesis (R) • NOTE: • There is no “second” hypothesis here. • We are guarding against type 1 errors (FP) • No control or assumptions are made about FN here !! • This setting is appropriate for the problem we have at hand (D.B search) cbio course, spring 2005, Hebrew University
Toy Problem • Let s,t be two randomly chosen DNA sequences of length n sampled from the uniform distribution over the DNA alphabet. • Align s versus t with no gaps (i.e. s[1] is aligned to t[1] until s[n] is aligned to t[n].) • What is the probability that there are k matches (not necessarily continuous ones) between s and t? • Suppose you are a researcher and you have two main hypothesis: Either these two sequences are totally unrelated or there was a common ancestor to both of them (there is no indel option here). • How would you use the number of matches to decide between the two options and attach a statistical confidence measure to this decision? cbio course, spring 2005, Hebrew University
d Empirical Distribution over scores, for p=0.25, using M = 100K samples 0.25 0.2 0.15 n = 20 0.1 n = 100 0.05 0 0 10 20 30 40 50 60 70 80 90 100 Pvalue for score = 30 and n= 100 NOTE: As in our “real” problem - pvalue of score depends on n cbio course, spring 2005, Hebrew University
EVD for our problem In the EVD approach, we are interested in the question: “what is the probability of me seeing such a good a score as S*, only from matches to non related sequences, if I tried N such matches?” Compute: If we want to guarantee the P{ Max(S1 … SN) >= S*} < 0.05 where Si are scores of matches against non related sequences sampled i.i.d , then: [1 – pvalue(S*)]N > 0.95 i.e cbio course, spring 2005, Hebrew University
Guarding against mistakes & evaluating performace • In the EVD we kept guarding against FP mistakes. This is very important when doing tasks where many tests are preformed, as in our case of D.B search • Sometime we are not able to compute EVD and we still want to control FPR. A very strict and simple Solution is the “Bonferroni corrected pvalue” = pvalue*N Where N is the number of tests perfromed. Note: The relation to the “union bound” is clear Problem: Bonf. Controls the FWER (family wise error rate) i.e the probability of seeing even 1 mistake in the results we report as significant (a FP mistake). It does so with very basically no assumption on the distribution, the relations between the hypothesis tested etc. and still guaranties control over FWER The price to pay is in FN ….. cbio course, spring 2005, Hebrew University
Bonf. Example on our case • We saw: If we want to guarantee the P{ Max(S1 … SN) >= S*} < 0.05 where Si are scores of matches against non related sequences sampled i.i.d , then: [1 – pvalue(S*)]N > 0.95 i.e Compare for N = 10 the result for this equation: 0.005116 to the Bonf. Corrected pvalue: 0.05/N = 0.005 For N = 20: 0.00256 vs. 0.002 etc… If we used the strict Bonf. for the same guarantee level we wanted, we might have rejected some “good” results. cbio course, spring 2005, Hebrew University
How to estimate performance? • Say you have a method with a score (in our case: “method” = scoring local alignment with affine gaps, and scoring matrix “Sigma” (e.g. Sigma = PAM1) • You set a threshold over the scores based on some criteria (e.g EVD estimation of the scores in random matches) • You want to evaluate your methods performace on some “test” data set. The data set would typically contain some true and false examples. • Assumption: you KNOW the answers of this test set ! Aim: You want to see the tradeoff you get for using various thresholds on the scores, in terms of FP and FN on the data set. cbio course, spring 2005, Hebrew University
ROC curves • ROC = Receiver Operator Curve Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture” Best Performance 100% FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass” 0% cbio course, spring 2005, Hebrew University 0% 100%
Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture” Best Performance 100% FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass” 0% 2% 0% 100% ROC curves • NOTE: Each point in the ROC matches a certain threshold over the method’s scores • Each method gives a different Curve • We can now compare methods performance: • At a certain point on the graph • Via the total size of area under the graph cbio course, spring 2005, Hebrew University
Back to our Toy Problem Assume the data we need to handle came from two sources, as in the Bayesian approach: R – no related sequences, p(a,a) = 0.25 M – related sequences p(a,a) = 0.4 p(a,b) = 0.2 Delta scoring matrix i.e. S(a,a) = 1 S(a,b) = 0 cbio course, spring 2005, Hebrew University
Finish with a Thought… • In our toy problem – what’s the relation between the graph of the last slide and the ROC curve we talked about? • How does the relative amount of samples from M and R in our data set effects the ROC? How should the total distribution over the scores look like? cbio course, spring 2005, Hebrew University