500 likes | 617 Views
Evaluation. Mining Biomedical Data Steinbach and Kumar Spring 2011. Overview of Evaluation. General Discussion Classification for SNP data (covered in classification ) Measures Discriminative patterns for SNP data Simple measures based on contingency tables
E N D
Evaluation Mining Biomedical Data Steinbach and Kumar Spring 2011
Overview of Evaluation • General Discussion • Classification for SNP data (covered in classification) • Measures • Discriminative patterns for SNP data • Simple measures based on contingency tables • Evaluation of gene expression modules • Enrichment • Multiple Hypothesis Testing • Randomization • Case-Control gene expression studies • GSEA CSCI 8980: Spring 2011: Mining Biomedical Data
What needs to be evaluated? • Classification models • How well will the model perform on new observations? • What does the model tell us about the domain? • Clusters • Are the clusters internally coherent? • What is the biological significance of the clusters? • Patterns • Which of the discriminating patterns are significant? • How much discrimination does a pattern provide? • What is the biological significance of the patterns or sets of patterns CSCI 8980: Spring 2011: Mining Biomedical Data
Examples • Case-control SNP association studies • Which SNPs or groups of SNPs have an association with the disease? • Transcription modules or clusters from gene expression data • Functionally, what does a group of genes represent? • Case-control gene expression studies • What genes or groups of genes are significantly under- or over-expressed? CSCI 8980: Spring 2011: Mining Biomedical Data
Objective vs. Subjective • Objective vs. domain/subjective • Many times the domain expert is the final evaluator • If the classification model, cluster, or pattern doesn’t make sense to the expert, then it is often useless • However, sometimes an initially suspect result represents an unexpected phenomenon CSCI 8980: Spring 2011: Mining Biomedical Data
Objective Evaluation Measures • Various measures are applied to the classification model, cluster, or association pattern to yield a number • Classification: Accuracy, precision, recall, ... • Clustering: SSE, silhouette coefficient, entropy, purity, enrichment • Association patterns (no classes): support, h-confidence • Discriminative patterns: odds ratio, p-value, DiffSup, chi-square, … CSCI 8980: Spring 2011: Mining Biomedical Data
Measures of Statistical and Practical Significance • Statistically, a result is significant if the chance (probability, p-value) of the result happening by random chance is low, e.g., 0.05 or 0.01 • Thus, we say a SNP or group of SNPs is significant if a measure of discrimination (e.g., odds ratio) is unlikely to be due to random variation • But practical significance is also important • Thus the magnitude of the discrimination is also important • Example: A SNP that has a -log10 p-value of 10 but an odds ratio of 1.1 is not very interesting CSCI 8980: Spring 2011: Mining Biomedical Data
Ground Truth • Often a ground truth data set is needed • In classification, ground truth is needed for training and evaluation • In clustering and pattern finding, known groupings or patterns are used to evaluate whether the algorithm is producing useful results • However, the ground truth may be incomplete and or erroneous • This can make model/cluster/pattern evaluation difficult or error prone CSCI 8980: Spring 2011: Mining Biomedical Data
Measures of Classification Performance is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP). is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN). CSCI 8980: Spring 2011: Mining Biomedical Data
Genetic Association Studies • We consider only certain kinds of studies • Case-control • Not family based • Given SNP data we can • Find biomarkers or • Build a classification model • Both approaches need evaluation CSCI 8980: Spring 2011: Mining Biomedical Data
Evaluating Biomarkers • For SNP data, a biomarker is a SNP or a set of SNPs • A patient has this marker or doesn’t • A biomarker defines a set of binary labels like the case control labels • Similar comments apply to gene expression based biomarkers CSCI 8980: Spring 2011: Mining Biomedical Data
Contingency Tables • Can summarize the relationship between a potential biomarker and the case-control labels by a contingency table CSCI 8980: Spring 2011: Mining Biomedical Data
Evaluation Measures • Many possible measures • Classification measures • Accuracy, precision, recall, F-measure • Statistical Measures • P-value, odds ratio • Association, similarity, and other measures • interest, cosine, mutual information CSCI 8980: Spring 2011: Mining Biomedical Data
Evaluation Measures … • Measures have a variety of characteristics • Symmetry, invariance to scaling, invariance to inversion, invariance to null addition • Selecting the right objective measure for association analysis, Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Inf. Syst., Vol. 29, No. 4. (2004), pp. 293-313. • Interestingness measures for data mining: A survey Liqiang Geng Howard J. Hamilton, ACM Comput. Surv. 38, 3 (Sep. 2006), 9. DOI= http://doi.acm.org/10.1145/1132960.1132963 • Most commonly used methods are odds ratio, and p-value CSCI 8980: Spring 2011: Mining Biomedical Data
Odds ratio • Odds ratio is defined as the following • Measures whether two groups have the same odds of an event. • Log odds ratio is often used • Odds ratio is invariant to row and column scaling CSCI 8980: Spring 2011: Mining Biomedical Data
P-value • P-value • Statistical terminology for a probability value • Is the probability that the we get an odds ratio as extreme as the one we got by random chance • Computed by using the chi-square statistic or Fisher’s exact test • Chi-square statistic is not valid if the number of entries in a cell of the contingency table is small • p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) if we are testing value is higher than expected by random chance using Fisher’s exact test • A statistical test to determine if there are nonrandom associations between two categorical variables. • P-values are often expressed in terms of the negative log of p-value, e.g., -log10(0.005) = 2.3 CSCI 8980: Spring 2011: Mining Biomedical Data
Example Odds ratio = (a*d)/(b*c) = (20 * 40) / (30 * 10) = 8/3 = 2.67 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 19, 100, 30, 50 ) = 0.0243 log10(0.0243) = 1.61 CSCI 8980: Spring 2011: Mining Biomedical Data
Example … Odds ratio = (a*d)/(b*c) = (200 * 400) / (300 * 100) = 8/3 = 2.67 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 199, 1000, 300, 500 ) = 2.7873e-012 log10(2.7873e-012) = 11.55 CSCI 8980: Spring 2011: Mining Biomedical Data
Example … Odds ratio = (a*d)/(b*c) = (4* 39) / (1* 26) = 6 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 3, 70, 5, 30) = 0.1023 log10(0.1023) = 0.99 CSCI 8980: Spring 2011: Mining Biomedical Data
Other Approaches • Chi-square and Fisher’s exact test are most common for single SNPs • Chi square can be used for either binary or ternary representation and for pairs • Can use the DiffSup measure • Can also look at measures based on the jump, i.e., the change in a measure between a pattern and it’s subpatterns • Example: synergy D. Anastassiou. Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology, 3(1), 2007. CSCI 8980: Spring 2011: Mining Biomedical Data
Evaluation of Modules and Clusters • Gene expression data is often used to find modules and clusters • Transcription modules • Clusters from co-clustering or other clustering techniques • All of these approaches define groups of genes • We can evaluate these groups of genes with respect to how coherent or “tight” they are • However, we often want to evaluate their meaning with respect to function • Evaluate in terms of Gene Ontology, KEGG pathways, etc. • Enrichment is a common way to do this • Entropy or purity often don’t work well CSCI 8980: Spring 2011: Mining Biomedical Data
Enrichment • Evaluation is difficult • Many proteins are not annotated • Not all proteins have high similarity to others • The overall ability to predict function based on similarity is limited and thus, performance is poor according to measures such as accuracy, precision, and recall • Still want to extract the most information possible • Ultimate goal is to combine the limited information available in the data set with additional information from other types of data • Can evaluate performance by comparison with random groups • Is the set of objects (with known labels) assigned to a class or a cluster enriched (over random) with respect to some functions • Enrichment is the occurrence of a functional label in a group at a higher level than is likely by mere chance CSCI 8980: Spring 2011: Mining Biomedical Data
Evaluation of Enrichment Using the Fisher’s Exact Test and the Hypergeometric Distribution • Hypergeometric distribution • Used for objects with two sets of binary categories • In our case genes that have the functional label and those that don’t • Four parameters • Total number of objects (genes or proteins) • Number of objects with the functional label • Size of a sample (cluster size) • Number of objects with specified label in the sample • Accounts for the size of the group and frequency of the label • Using the hypergeometric distribution, you can compute the probability of the given number of genes or more of the given label. • If the probability is low, then we say that the group is enriched in the functional label • This approach is known as Fisher’s exact test • Could also use chi-square CSCI 8980: Spring 2011: Mining Biomedical Data
Hypergeometric Distribution • Suppose you have • N objects (genes) • A sample of objects of size n (the genes in a module or cluster) • A special class of objects of size m (genes with some functional label) • k objects of the special class in the sample • What is the probability that this happens by chance? • To find if we see at least k special objects we can sum the probabilities of all possible values of k CSCI 8980: Spring 2011: Mining Biomedical Data
Hypergeometric Distribution … • Suppose we have 6,000 genes, a functional class of size 100, a cluster of size 50, and there are 7 genes of the functional class in the cluster. • How likely is it that we see 7 or more genes of the functional class in our cluster? CSCI 8980: Spring 2011: Mining Biomedical Data
Example of Evaluation of Enrichment Using the Hypergeometric Distribution • Computed using a contingency table • p-value = probability of the observed value or something higher • p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ), where hygecdf is a MATLAB function that computes the hypergeometric cumulative probability function • p-value = 0.0560 for the table above CSCI 8980: Spring 2011: Mining Biomedical Data
Applications • Used to evaluate transcription modules • Defining transcription modules using large-scale gene expression data, Ihmels, Bergmann, and Barkai, Bioinformatics, Vol. 20 no. 13 2004, pages 1993–2003. doi:10.1093/bioinformatics/bth166 • The goal is to use expression data to find groups of genes that are expressed together under a subset of conditions • These are called transcription modules • The goodness of a module can be evaluated by the enrichment of GO terms among the genes in a module • Used bythe GO:Termfinderhttp://www.yeastgenome.org/help/goTermFinder.html • A correction for multiple hypothesis tests is used for these applications • More on this later CSCI 8980: Spring 2011: Mining Biomedical Data
Functional Group Verification Using Gene Ontology • Hypothesis: Proteins within the same group or pattern are more likely to perform the same function and participate in the same biological process • Gene Ontology • Three separate ontologies: Biological Process, Molecular Function, Cellular Component • Organized as a DAG describing gene products (proteins and functional RNA) • Collaborative effort between major genome databases http://www.geneontology.org
Multiple Hypothesis Testing • Apparently interesting results can arise by random chance • Ex: A SNP might match the case-control labels relatively well by chance, especially if the number of cases is small • Ex: Given dozens of modules and hundreds of pathways, it is possible that one module will have an unusually high number of genes from one pathway by chance • In other words, an apparent pattern or association can just be coincidence • These results can lead to false conclusions and published results that cannot be replicated CSCI 8980: Spring 2011: Mining Biomedical Data
Formalities • In statistics, many problems are formulated as hypothesis testing: • H0: The null hypothesis • For example, the means of two sets of values are the same. • H1: Alternative hypothesis • For example, the means of two sets of values different • A statistic that provides information about the null hypothesis must be computed • For example, to test whether two sets of values have the same distribution, we can compute the t-statistic CSCI 8980: Spring 2011: Mining Biomedical Data
Formalities ... • The null hypothesis is evaluated by computing a statistic and seeing how likely the value of this statistic is given the null hypothesis • The distribution of the statistic under H0 is the null distribution • A region of extreme values is defined for the test statistic • The probability of the critical regions under the null distribution is known as the size of the test or • The key assumption is that an event that is unlikely under the null hypothesis means the null hypothesis is false • Thus, if statistic falls in the critical regions it is “better” to assume that the null hypothesis is false rather than that something very unlikely happened. CSCI 8980: Spring 2011: Mining Biomedical Data
Formalities ... • Usually a p-value of 0.05 or 0.01 is used for the significance • However, unlikely events do happen, especially when many tests occur • For example, a run of ten heads on a fair coin only occurs once in 1024 times, but if 10,000 people flip a fair coin 10 times, there will be about 10 such runs • Given 20,000 genes, a case-control expression data set, and a significance level of 0.01, how many genes would you expect to be marked as significant by random chance? CSCI 8980: Spring 2011: Mining Biomedical Data
Formalities … • The probability of rejecting the null hypothesis when it is true is a Type I error • This is also known as a false positive • The probability of accepting the null hypothesis when it is false is a Type II error • This is also known as a false negative CSCI 8980: Spring 2011: Mining Biomedical Data
Connection to Evaluation • When looking for markers in SNP data, each SNP is a test – can be hundreds of thousands • Situation is much worse if you consider pairs, triplets, etc. • When evaluating under or over expression, each gene is a test – can be thousands • When using gene sets, each gene set is a test • When assessing modules and enrichment, the number of tests is the number of modules * the number of pathways, GO terms, etc. • This can be millions CSCI 8980: Spring 2011: Mining Biomedical Data
Bonferroni Correction • One simple way to address the problem of multiple testing is to use the Bonferroni correction • Divide the significance level by the number of tests • For example, if = 0.01 and n = 10,000, the new significance level is ’ = 10-6 • This is too strict since it tries to ensure that that we don’t get any false positives • However, this means we get many false negatives • Interesting results are discarded CSCI 8980: Spring 2011: Mining Biomedical Data
False Discovery Rate • More useful to control the proportion of incorrectly rejected null hypotheses • This can be done by considering the false discovery rate (FDR) • Note that FDR False Positive Rate • FDR is the expected fraction of significant results that aren’t significant • How many bogus results are you willing to accept to get more interesting results? 1 out of 100? 1 out of 10? 1 out of 2? • Actual definition is a bit more complex CSCI 8980: Spring 2011: Mining Biomedical Data
False Discovery Rate … • Various techniques have been defined to compute the FDR • Given a set of p-values, the FDR can be computed for various significance levels • Choose the significance level that gives the desired FDR, if possible • Key reference for FDR is Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. CSCI 8980: Spring 2011: Mining Biomedical Data
q values • Storey developed an approach to multiple hypothesis testing and FDR based on q values • If a test has a q value of 10%, then if 10% is taken as the significance threshold, this will result in an FDR of 10% among significant features • Thus, a q value is computed for every test based on the set of all p-values for the tests • A user chooses the q value to trade off the number of hypotheses detected and the false discovery rate • Statistical significance for genome-wide studies, JD Storey, R. Tibshirani, Proc NatlAcadSci, 100:9440–9445, 2003. CSCI 8980: Spring 2011: Mining Biomedical Data
Example - FDR load prostatecancerexpdata pvalues = mattest(dependentData, independentData, 'permute', true); [fdr, q] = mafdr(pvalues, 'showplot', true, 'BHFDR', true); CSCI 8980: Spring 2011: Mining Biomedical Data
Example - q values load prostatecancerexpdata pvalues = mattest(dependentData, independentData, 'permute', true); [fdr, q] = mafdr(pvalues, 'showplot', true); [ CSCI 8980: Spring 2011: Mining Biomedical Data
Randomization • Another approach to multiple hypothesis testing is to use randomization • Generate a null distribution for an entire set of values • Example: • Randomize case control labels and compute the maximum p value of all SNPs. • Do this 1000 times • This distribution of maximum p-values gives an overall bound on how good the best p-value is CSCI 8980: Spring 2011: Mining Biomedical Data
Example - SNPs CSCI 8980: Spring 2011: Mining Biomedical Data
Randomization • Also known as permutation tests • Can also generate “randomized” or synthetic data • This is much harder since it is important to preserve structure • Example: A random set of binary vectors is not a realistic set of SNP because it does not display linkage disequilibrium • Best to randomize class labels if they are present, but not all problems fall in this category CSCI 8980: Spring 2011: Mining Biomedical Data
Case-Control Gene Expression Studies • Gene expression data with case controls labels are interested in identifying which genes are over or under expressed • Goal is to understand the difference in cell function between case and control by analyzing the differences in gene expression • Basic approach is to analyze one gene at a time • Can use a t-test to compare expression values for a gene between cases and controls • If the p-value of the t-test is low the gene is differentially expressed • Need multiple comparison correction – discussed later • However, looking at individual genes is problematic CSCI 8980: Spring 2011: Mining Biomedical Data
Challenges in Analyzing Differential Expression • Subramanian et. al describe four key issue with testing a single gene at a time • No individual gene may show statistically significant differential expression • No coherent pattern among significant genes, so biological interpretation is hard • Smaller, coordinated changes among related groups of genes are missed • Different studies produce different significant results • A new approach, Gene Set Enrichment Analysis (GSEA) was proposed • Gene Set Enrichment Analysis: A knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles, Subramanian et. al, PNAS, 2005. CSCI 8980: Spring 2011: Mining Biomedical Data
Gene Set Enrichment Analysis (GSEA) • Starting situation: • Gene expression matrix with case-control labels • Compute a value for each gene that assesses how much expression values for that gene differ between cases and control, e.g., correlation with class labels or t-test statistic • Genes are ordered according to this value Ex. ALL cancer vs. AML cancer tissue Genes cases controls CSCI 8980: Spring 2011: Mining Biomedical Data
GSEA Algorithm • Calculate the enrichment score of each gene set • Move down the list and increment a counter • If the current gene is a member of the set, add quantity whose magnitude depends on the genes correlation with the class variable • For non-members subtract this quantity • The enrichment score (ES) is the maximum deviation from 0 • Estimate the significance level of the gene set enrichment values • Repeatedly permute the class labels (1000 times) and then calculate the enrichment values of the gene sets • The significance level of each gene set is determined by this null distribution • For example, if the enrichment value of a gene set is greater than all but 5 of the enrichments values from the permutation, then its p-value is 0.005. • Perform an adjustment for multiple hypothesis testing • Code and gene sets available at http://www.broad.mit.edu/gsea/ Figure 1 from Gene Set Enrichment Analysis: A knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles, Subramanian et. al, PNAS, 2005. CSCI 8980: Spring 2011: Mining Biomedical Data
Other Approaches • Gene Set Analysis(GSA) • Efron B, Tibshirani R: On testing the significance of sets of genes. The Annals of Applied Statistics 2007, 1:107-129. http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1183143731 • Different approach, but same goal – identify sets of genes that are different from a random set of genes in terms of over or under expression • It can have better statistical power than GSEA • Implemented in R • Code and gene sets available at http://www-stat.stanford.edu/~tibs/GSA/ CSCI 8980: Spring 2011: Mining Biomedical Data
GSA Algorithm • For each gene, compute a t-statistic or some other summary statistic • For each gene set, compute the mean or some other summary statistic. • Use randomization across genes to generate a null distribution and use the mean and standard deviation of distribution to standardize the gene set value of the previous step • Use permutation of the class labels to generate another null distribution, and estimate the p-values of the normalized gene set values • Apply multiple hypothesis corrections CSCI 8980: Spring 2011: Mining Biomedical Data
Sub-GSE • GSEA, GSA and other approaches test the association of all genes in the set with the class labels. • In reality, only a subset of the genes have an association • Gene set enrichment by testing subset association (Sub-GSE) looks at subsets • Uses only top-k subsets of a gene set by ordering genes according to t-test, correlation with class labels, etc. • Comparisons with GSA and GSEA show greater sensitivity in some cases • Program available at http://www-rcf.usc.edu/~fsun/Programs/SubGSEWebPages/SubGSEMain.html • Testing gene set enrichment for subset of genes: Sub-GSE, Xiting Yan and Fengzhu Sun, BMC Bioinformatics 2008, 9:362doi:10.1186/1471-2105-9-362 CSCI 8980: Spring 2011: Mining Biomedical Data