310 likes | 443 Views
Application of Class Discovery and Class Prediction Methods to Microarray Data. Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu. Basis of Cancer Diagnosis. Pathologist makes an interpretation based upon a compendium of knowledge which may include
E N D
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu
Basis of Cancer Diagnosis • Pathologist makes an interpretation based upon a compendium of knowledge which may include • Morphological appearance of the tumor • Histochemistry • Immunophenotyping • Cytogenetic analysis • etc.
Improved Cancer Diagnosis: Identify sub-classes • Divide morphologically similar tumors into different groups based on response. • Application of microarrays: Characterize molecular variations among tumors by monitoring gene expression • Goal: microarrays will lead to more reliable tumor classification and sub-classification (therefore, more appropriate treatments will be administered resulting in improved outcomes)
Distinguishing two types of acute leukemia (AML vs. ALL) • Golub, T.R. et al 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537. • http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi(near bottom of page)
Distinguishing AML vs. ALL • 38 BM samples (27 childhood ALL, 11 adult AML) were hybridized to Affymetrix GeneChips • GeneChip included 6,817 human genes. • Affymetrix MAS 4.0 software was used to perform image analysis. • MAS 4.0 Average Difference expression summary method was applied to the probe level data to obtain probe set expression summaries. • Scaling factor was used to normalize the GeneChips. • Samples were required to meet quality control criteria.
Distinguishing AML vs. ALL • Class comparison • Neighborhood analysis • Class prediction • Weighted voting
Class Discovery: Distinguishing AML vs. ALL • The mean of a random variable X is a measure of central location of the density of X. • The variance of a random variable is a measure of spread or dispersion of the density of X. • Var(X)=E[(X-)2] =∑(X - )2/(n-1) • Standard deviation = =(X)
Class Discovery: Distinguishing AML vs. ALL • For each gene, compute the log of the expression values. For a given gene g, For ALL Let represent the mean log expression value; Let represent the stdev log expression value. For AML Let represent the mean log expression value; Let represent the stdev log expression value.
Class Discovery: Distinguishing AML vs. ALLIllustration usingALL AML example.xls
Class Discovery: Distinguishing AML vs. ALL • For each gene, compute a relative class separation (quasi-correlation measure) as follows • Define neighborhoods of radius r about classes 1 and 2 such that P(g,c) > r or P(g,c) < -r. r was chosen to be 0.3
Aside • This differs from Pearson’s correlation and is therefore not confined to [-1,1] interval
Class Discovery: Distinguishing AML vs. ALL • A permutation test was used to calculate whether the observed number of genes in a neighborhood was significantly higher than expected.
Permutation based methods • Permutation based adjusted p-values • Under the complete null, the joint distribution of the test statistics can be estimated by permuting the columns of the gene expression matrix • Permuting entire columns creates a situation in which membership to the Class 1 and Class 2 groups is independent of gene expression but preserves the dependence structure between genes
Permutation based methods • Permutation algorithm for the bth permutation, b=1,…,B • 1) Permute the n labels of the data matrix X • 2) Compute relative class separation P(g1,c)b,…, P(gp,c)b for each gene gi. • The permutation distribution of the relative class separation P(g,c) for gene gi, i=1,…,p is given by the empirical distribution of P(g,c)j,1,…, P(g,c)j,B.
Distinguishing AML vs. ALL • Class comparisons using neighborhood analysis revealed approximately 1,100 genes were correlated with class (AML or ALL) than would be expected by chance.
Class Prediction: Distinguishing AML vs. ALL • For set of informative genes, each expression value xi votes for either ALL or AML, depending on whether its expression value is closer to μALL or μAML • Let μALL represent the mean expression value for ALL • Let μAML represent the mean expression value for AML • Informative genes were the n/2 genes with the largest P(g,c) and the n/2 genes with the smallest P(g,c) • Golub et al choose n = 50
Class Prediction: Distinguishing AML vs. ALL • wi is a weighting factor that reflects how well the gene is correlated with class distinction; wivi is the weighted vote • For each sample, the weighted votes for each class are summed to get VALL and VAML • The sample is assigned to the class with the higher total, provided the Prediction Strength (PS) > 0.3 where PS = (Vwin – Vlose)/ (Vwin + Vlose)
Class Prediction: Distinguishing AML vs. ALL • Checking model adequacy • Cross-validation of training dataset • Applied model to an independent dataset of 34 samples
Class Discovery • Determine whether the samples can be divided based only on gene expression without regard to the class labels • Self-organizing maps
Hypothesis Testing • The hypothesis that two means 1 and 2 are equal is called a null hypothesis, commonly abbreviated H0. • This is typically written as H0: 1 = 2 • Its antithesis is the alternative hypothesis, HA: 1 2
Hypothesis Testing • A statistical test of hypothesis is a procedure for assessing the compatibility of the data with the null hypothesis. • The data are considered compatible with H0 if any discrepancy from H0 could readily be due to chance (i.e., sampling error). • Data judged to be incompatible with H0 are taken as evidence in favor of HA.
Hypothesis Testing • If the sample means calculated are identical, we would suspect the null hypothesis is true. • Even if the null hypothesis is true, we do not really expect the sample means to be identically equal because of sampling variability. • We would feel comfortable concluding H0 is true if the chance difference in the sample means should not exceed a couple of standard errors.
In testing H0: 1 = 2 against HA: 1 2 note that we could have restated the null hypothesis as H0: 1 - 2 = 0 and HA: 1 - 2 0 To carry out the t-test, the first step is to compute the test statistic and then compare the result to a t-distribution with the appropriate degrees of freedom (df) T-test
T-test • Data must be independent random samples from their respective populations • Sample size should either be large or, in the case of small sample sizes, the population distributions must be approximately normally distributed. • When assumptions are not met, non-parametric alternatives are available (Wilcoxon Rank Sum/Mann-Whitney Test)
T-test: Probe set 208680_at P=0.039