Evaluation

Evaluation Mining Biomedical Data Steinbach and Kumar Spring 2011

Overview of Evaluation • General Discussion • Classification for SNP data (covered in classification) • Measures • Discriminative patterns for SNP data • Simple measures based on contingency tables • Evaluation of gene expression modules • Enrichment • Multiple Hypothesis Testing • Randomization • Case-Control gene expression studies • GSEA CSCI 8980: Spring 2011: Mining Biomedical Data

What needs to be evaluated? • Classification models • How well will the model perform on new observations? • What does the model tell us about the domain? • Clusters • Are the clusters internally coherent? • What is the biological significance of the clusters? • Patterns • Which of the discriminating patterns are significant? • How much discrimination does a pattern provide? • What is the biological significance of the patterns or sets of patterns CSCI 8980: Spring 2011: Mining Biomedical Data

Examples • Case-control SNP association studies • Which SNPs or groups of SNPs have an association with the disease? • Transcription modules or clusters from gene expression data • Functionally, what does a group of genes represent? • Case-control gene expression studies • What genes or groups of genes are significantly under- or over-expressed? CSCI 8980: Spring 2011: Mining Biomedical Data

Objective vs. Subjective • Objective vs. domain/subjective • Many times the domain expert is the final evaluator • If the classification model, cluster, or pattern doesn’t make sense to the expert, then it is often useless • However, sometimes an initially suspect result represents an unexpected phenomenon CSCI 8980: Spring 2011: Mining Biomedical Data

Objective Evaluation Measures • Various measures are applied to the classification model, cluster, or association pattern to yield a number • Classification: Accuracy, precision, recall, ... • Clustering: SSE, silhouette coefficient, entropy, purity, enrichment • Association patterns (no classes): support, h-confidence • Discriminative patterns: odds ratio, p-value, DiffSup, chi-square, … CSCI 8980: Spring 2011: Mining Biomedical Data

Measures of Statistical and Practical Significance • Statistically, a result is significant if the chance (probability, p-value) of the result happening by random chance is low, e.g., 0.05 or 0.01 • Thus, we say a SNP or group of SNPs is significant if a measure of discrimination (e.g., odds ratio) is unlikely to be due to random variation • But practical significance is also important • Thus the magnitude of the discrimination is also important • Example: A SNP that has a -log10 p-value of 10 but an odds ratio of 1.1 is not very interesting CSCI 8980: Spring 2011: Mining Biomedical Data

Ground Truth • Often a ground truth data set is needed • In classification, ground truth is needed for training and evaluation • In clustering and pattern finding, known groupings or patterns are used to evaluate whether the algorithm is producing useful results • However, the ground truth may be incomplete and or erroneous • This can make model/cluster/pattern evaluation difficult or error prone CSCI 8980: Spring 2011: Mining Biomedical Data

Measures of Classification Performance  is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).  is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN). CSCI 8980: Spring 2011: Mining Biomedical Data

Genetic Association Studies • We consider only certain kinds of studies • Case-control • Not family based • Given SNP data we can • Find biomarkers or • Build a classification model • Both approaches need evaluation CSCI 8980: Spring 2011: Mining Biomedical Data

Evaluating Biomarkers • For SNP data, a biomarker is a SNP or a set of SNPs • A patient has this marker or doesn’t • A biomarker defines a set of binary labels like the case control labels • Similar comments apply to gene expression based biomarkers CSCI 8980: Spring 2011: Mining Biomedical Data

Contingency Tables • Can summarize the relationship between a potential biomarker and the case-control labels by a contingency table CSCI 8980: Spring 2011: Mining Biomedical Data

Evaluation Measures • Many possible measures • Classification measures • Accuracy, precision, recall, F-measure • Statistical Measures • P-value, odds ratio • Association, similarity, and other measures • interest, cosine, mutual information CSCI 8980: Spring 2011: Mining Biomedical Data

Evaluation Measures … • Measures have a variety of characteristics • Symmetry, invariance to scaling, invariance to inversion, invariance to null addition • Selecting the right objective measure for association analysis, Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Inf. Syst., Vol. 29, No. 4. (2004), pp. 293-313. • Interestingness measures for data mining: A survey Liqiang Geng Howard J. Hamilton, ACM Comput. Surv. 38, 3 (Sep. 2006), 9. DOI= http://doi.acm.org/10.1145/1132960.1132963 • Most commonly used methods are odds ratio, and p-value CSCI 8980: Spring 2011: Mining Biomedical Data

Odds ratio • Odds ratio is defined as the following • Measures whether two groups have the same odds of an event. • Log odds ratio is often used • Odds ratio is invariant to row and column scaling CSCI 8980: Spring 2011: Mining Biomedical Data

P-value • P-value • Statistical terminology for a probability value • Is the probability that the we get an odds ratio as extreme as the one we got by random chance • Computed by using the chi-square statistic or Fisher’s exact test • Chi-square statistic is not valid if the number of entries in a cell of the contingency table is small • p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) if we are testing value is higher than expected by random chance using Fisher’s exact test • A statistical test to determine if there are nonrandom associations between two categorical variables. • P-values are often expressed in terms of the negative log of p-value, e.g., -log10(0.005) = 2.3 CSCI 8980: Spring 2011: Mining Biomedical Data

Example Odds ratio = (a*d)/(b*c) = (20 * 40) / (30 * 10) = 8/3 = 2.67 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 19, 100, 30, 50 ) = 0.0243 log10(0.0243) = 1.61 CSCI 8980: Spring 2011: Mining Biomedical Data

Example … Odds ratio = (a*d)/(b*c) = (200 * 400) / (300 * 100) = 8/3 = 2.67 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 199, 1000, 300, 500 ) = 2.7873e-012 log10(2.7873e-012) = 11.55 CSCI 8980: Spring 2011: Mining Biomedical Data

Example … Odds ratio = (a*d)/(b*c) = (4* 39) / (1* 26) = 6 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 3, 70, 5, 30) = 0.1023 log10(0.1023) = 0.99 CSCI 8980: Spring 2011: Mining Biomedical Data

Other Approaches • Chi-square and Fisher’s exact test are most common for single SNPs • Chi square can be used for either binary or ternary representation and for pairs • Can use the DiffSup measure • Can also look at measures based on the jump, i.e., the change in a measure between a pattern and it’s subpatterns • Example: synergy D. Anastassiou. Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology, 3(1), 2007. CSCI 8980: Spring 2011: Mining Biomedical Data

Evaluation of Modules and Clusters • Gene expression data is often used to find modules and clusters • Transcription modules • Clusters from co-clustering or other clustering techniques • All of these approaches define groups of genes • We can evaluate these groups of genes with respect to how coherent or “tight” they are • However, we often want to evaluate their meaning with respect to function • Evaluate in terms of Gene Ontology, KEGG pathways, etc. • Enrichment is a common way to do this • Entropy or purity often don’t work well CSCI 8980: Spring 2011: Mining Biomedical Data

Enrichment • Evaluation is difficult • Many proteins are not annotated • Not all proteins have high similarity to others • The overall ability to predict function based on similarity is limited and thus, performance is poor according to measures such as accuracy, precision, and recall • Still want to extract the most information possible • Ultimate goal is to combine the limited information available in the data set with additional information from other types of data • Can evaluate performance by comparison with random groups • Is the set of objects (with known labels) assigned to a class or a cluster enriched (over random) with respect to some functions • Enrichment is the occurrence of a functional label in a group at a higher level than is likely by mere chance CSCI 8980: Spring 2011: Mining Biomedical Data

Evaluation of Enrichment Using the Fisher’s Exact Test and the Hypergeometric Distribution • Hypergeometric distribution • Used for objects with two sets of binary categories • In our case genes that have the functional label and those that don’t • Four parameters • Total number of objects (genes or proteins) • Number of objects with the functional label • Size of a sample (cluster size) • Number of objects with specified label in the sample • Accounts for the size of the group and frequency of the label • Using the hypergeometric distribution, you can compute the probability of the given number of genes or more of the given label. • If the probability is low, then we say that the group is enriched in the functional label • This approach is known as Fisher’s exact test • Could also use chi-square CSCI 8980: Spring 2011: Mining Biomedical Data

Hypergeometric Distribution • Suppose you have • N objects (genes) • A sample of objects of size n (the genes in a module or cluster) • A special class of objects of size m (genes with some functional label) • k objects of the special class in the sample • What is the probability that this happens by chance? • To find if we see at least k special objects we can sum the probabilities of all possible values of k CSCI 8980: Spring 2011: Mining Biomedical Data

Hypergeometric Distribution … • Suppose we have 6,000 genes, a functional class of size 100, a cluster of size 50, and there are 7 genes of the functional class in the cluster. • How likely is it that we see 7 or more genes of the functional class in our cluster? CSCI 8980: Spring 2011: Mining Biomedical Data

Example of Evaluation of Enrichment Using the Hypergeometric Distribution • Computed using a contingency table • p-value = probability of the observed value or something higher • p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ), where hygecdf is a MATLAB function that computes the hypergeometric cumulative probability function • p-value = 0.0560 for the table above CSCI 8980: Spring 2011: Mining Biomedical Data

Applications • Used to evaluate transcription modules • Defining transcription modules using large-scale gene expression data, Ihmels, Bergmann, and Barkai, Bioinformatics, Vol. 20 no. 13 2004, pages 1993–2003. doi:10.1093/bioinformatics/bth166 • The goal is to use expression data to find groups of genes that are expressed together under a subset of conditions • These are called transcription modules • The goodness of a module can be evaluated by the enrichment of GO terms among the genes in a module • Used bythe GO:Termfinderhttp://www.yeastgenome.org/help/goTermFinder.html • A correction for multiple hypothesis tests is used for these applications • More on this later CSCI 8980: Spring 2011: Mining Biomedical Data

Functional Group Verification Using Gene Ontology • Hypothesis: Proteins within the same group or pattern are more likely to perform the same function and participate in the same biological process • Gene Ontology • Three separate ontologies: Biological Process, Molecular Function, Cellular Component • Organized as a DAG describing gene products (proteins and functional RNA) • Collaborative effort between major genome databases http://www.geneontology.org

Multiple Hypothesis Testing • Apparently interesting results can arise by random chance • Ex: A SNP might match the case-control labels relatively well by chance, especially if the number of cases is small • Ex: Given dozens of modules and hundreds of pathways, it is possible that one module will have an unusually high number of genes from one pathway by chance • In other words, an apparent pattern or association can just be coincidence • These results can lead to false conclusions and published results that cannot be replicated CSCI 8980: Spring 2011: Mining Biomedical Data

Formalities • In statistics, many problems are formulated as hypothesis testing: • H0: The null hypothesis • For example, the means of two sets of values are the same. • H1: Alternative hypothesis • For example, the means of two sets of values different • A statistic that provides information about the null hypothesis must be computed • For example, to test whether two sets of values have the same distribution, we can compute the t-statistic CSCI 8980: Spring 2011: Mining Biomedical Data

Formalities ... • The null hypothesis is evaluated by computing a statistic and seeing how likely the value of this statistic is given the null hypothesis • The distribution of the statistic under H0 is the null distribution • A region of extreme values is defined for the test statistic • The probability of the critical regions under the null distribution is known as the size of the test or  • The key assumption is that an event that is unlikely under the null hypothesis means the null hypothesis is false • Thus, if statistic falls in the critical regions it is “better” to assume that the null hypothesis is false rather than that something very unlikely happened. CSCI 8980: Spring 2011: Mining Biomedical Data

Formalities ... • Usually a p-value of 0.05 or 0.01 is used for the significance • However, unlikely events do happen, especially when many tests occur • For example, a run of ten heads on a fair coin only occurs once in 1024 times, but if 10,000 people flip a fair coin 10 times, there will be about 10 such runs • Given 20,000 genes, a case-control expression data set, and a significance level of 0.01, how many genes would you expect to be marked as significant by random chance? CSCI 8980: Spring 2011: Mining Biomedical Data

Formalities … • The probability of rejecting the null hypothesis when it is true is a Type I error • This is also known as a false positive • The probability of accepting the null hypothesis when it is false is a Type II error • This is also known as a false negative CSCI 8980: Spring 2011: Mining Biomedical Data

Connection to Evaluation • When looking for markers in SNP data, each SNP is a test – can be hundreds of thousands • Situation is much worse if you consider pairs, triplets, etc. • When evaluating under or over expression, each gene is a test – can be thousands • When using gene sets, each gene set is a test • When assessing modules and enrichment, the number of tests is the number of modules * the number of pathways, GO terms, etc. • This can be millions CSCI 8980: Spring 2011: Mining Biomedical Data

Bonferroni Correction • One simple way to address the problem of multiple testing is to use the Bonferroni correction • Divide the significance level by the number of tests • For example, if  = 0.01 and n = 10,000, the new significance level is ’ = 10-6 • This is too strict since it tries to ensure that that we don’t get any false positives • However, this means we get many false negatives • Interesting results are discarded CSCI 8980: Spring 2011: Mining Biomedical Data

False Discovery Rate • More useful to control the proportion of incorrectly rejected null hypotheses • This can be done by considering the false discovery rate (FDR) • Note that FDR  False Positive Rate • FDR is the expected fraction of significant results that aren’t significant • How many bogus results are you willing to accept to get more interesting results? 1 out of 100? 1 out of 10? 1 out of 2? • Actual definition is a bit more complex CSCI 8980: Spring 2011: Mining Biomedical Data

False Discovery Rate … • Various techniques have been defined to compute the FDR • Given a set of p-values, the FDR can be computed for various significance levels • Choose the significance level that gives the desired FDR, if possible • Key reference for FDR is Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. CSCI 8980: Spring 2011: Mining Biomedical Data

q values • Storey developed an approach to multiple hypothesis testing and FDR based on q values • If a test has a q value of 10%, then if 10% is taken as the significance threshold, this will result in an FDR of 10% among significant features • Thus, a q value is computed for every test based on the set of all p-values for the tests • A user chooses the q value to trade off the number of hypotheses detected and the false discovery rate • Statistical significance for genome-wide studies, JD Storey, R. Tibshirani, Proc NatlAcadSci, 100:9440–9445, 2003. CSCI 8980: Spring 2011: Mining Biomedical Data

Example - FDR load prostatecancerexpdata pvalues = mattest(dependentData, independentData, 'permute', true); [fdr, q] = mafdr(pvalues, 'showplot', true, 'BHFDR', true); CSCI 8980: Spring 2011: Mining Biomedical Data

Example - q values load prostatecancerexpdata pvalues = mattest(dependentData, independentData, 'permute', true); [fdr, q] = mafdr(pvalues, 'showplot', true); [ CSCI 8980: Spring 2011: Mining Biomedical Data

Randomization • Another approach to multiple hypothesis testing is to use randomization • Generate a null distribution for an entire set of values • Example: • Randomize case control labels and compute the maximum p value of all SNPs. • Do this 1000 times • This distribution of maximum p-values gives an overall bound on how good the best p-value is CSCI 8980: Spring 2011: Mining Biomedical Data

Example - SNPs CSCI 8980: Spring 2011: Mining Biomedical Data

Randomization • Also known as permutation tests • Can also generate “randomized” or synthetic data • This is much harder since it is important to preserve structure • Example: A random set of binary vectors is not a realistic set of SNP because it does not display linkage disequilibrium • Best to randomize class labels if they are present, but not all problems fall in this category CSCI 8980: Spring 2011: Mining Biomedical Data

Case-Control Gene Expression Studies • Gene expression data with case controls labels are interested in identifying which genes are over or under expressed • Goal is to understand the difference in cell function between case and control by analyzing the differences in gene expression • Basic approach is to analyze one gene at a time • Can use a t-test to compare expression values for a gene between cases and controls • If the p-value of the t-test is low the gene is differentially expressed • Need multiple comparison correction – discussed later • However, looking at individual genes is problematic CSCI 8980: Spring 2011: Mining Biomedical Data

Challenges in Analyzing Differential Expression • Subramanian et. al describe four key issue with testing a single gene at a time • No individual gene may show statistically significant differential expression • No coherent pattern among significant genes, so biological interpretation is hard • Smaller, coordinated changes among related groups of genes are missed • Different studies produce different significant results • A new approach, Gene Set Enrichment Analysis (GSEA) was proposed • Gene Set Enrichment Analysis: A knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles, Subramanian et. al, PNAS, 2005. CSCI 8980: Spring 2011: Mining Biomedical Data

Gene Set Enrichment Analysis (GSEA) • Starting situation: • Gene expression matrix with case-control labels • Compute a value for each gene that assesses how much expression values for that gene differ between cases and control, e.g., correlation with class labels or t-test statistic • Genes are ordered according to this value Ex. ALL cancer vs. AML cancer tissue Genes cases controls CSCI 8980: Spring 2011: Mining Biomedical Data

GSEA Algorithm • Calculate the enrichment score of each gene set • Move down the list and increment a counter • If the current gene is a member of the set, add quantity whose magnitude depends on the genes correlation with the class variable • For non-members subtract this quantity • The enrichment score (ES) is the maximum deviation from 0 • Estimate the significance level of the gene set enrichment values • Repeatedly permute the class labels (1000 times) and then calculate the enrichment values of the gene sets • The significance level of each gene set is determined by this null distribution • For example, if the enrichment value of a gene set is greater than all but 5 of the enrichments values from the permutation, then its p-value is 0.005. • Perform an adjustment for multiple hypothesis testing • Code and gene sets available at http://www.broad.mit.edu/gsea/ Figure 1 from Gene Set Enrichment Analysis: A knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles, Subramanian et. al, PNAS, 2005. CSCI 8980: Spring 2011: Mining Biomedical Data

Other Approaches • Gene Set Analysis(GSA) • Efron B, Tibshirani R: On testing the significance of sets of genes. The Annals of Applied Statistics 2007, 1:107-129. http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1183143731 • Different approach, but same goal – identify sets of genes that are different from a random set of genes in terms of over or under expression • It can have better statistical power than GSEA • Implemented in R • Code and gene sets available at http://www-stat.stanford.edu/~tibs/GSA/ CSCI 8980: Spring 2011: Mining Biomedical Data

GSA Algorithm • For each gene, compute a t-statistic or some other summary statistic • For each gene set, compute the mean or some other summary statistic. • Use randomization across genes to generate a null distribution and use the mean and standard deviation of distribution to standardize the gene set value of the previous step • Use permutation of the class labels to generate another null distribution, and estimate the p-values of the normalized gene set values • Apply multiple hypothesis corrections CSCI 8980: Spring 2011: Mining Biomedical Data

Sub-GSE • GSEA, GSA and other approaches test the association of all genes in the set with the class labels. • In reality, only a subset of the genes have an association • Gene set enrichment by testing subset association (Sub-GSE) looks at subsets • Uses only top-k subsets of a gene set by ordering genes according to t-test, correlation with class labels, etc. • Comparisons with GSA and GSEA show greater sensitivity in some cases • Program available at http://www-rcf.usc.edu/~fsun/Programs/SubGSEWebPages/SubGSEMain.html • Testing gene set enrichment for subset of genes: Sub-GSE, Xiting Yan and Fengzhu Sun, BMC Bioinformatics 2008, 9:362doi:10.1186/1471-2105-9-362 CSCI 8980: Spring 2011: Mining Biomedical Data

Evaluation

Evaluation

Presentation Transcript

evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation Economic Evaluation

Evaluation

Evaluation