390 likes | 501 Views
Impact of the Choice of Expression Metric on the Standard Statistical Analysis of Oligonucleotide Microarray Data. David Elashoff UCLA Department of Biostatistics. Outline. Introduction to Affymetrix Microarrays Data Preprocessing Methods Within metric comparisons
E N D
Impact of the Choice of Expression Metric on the Standard Statistical Analysis of Oligonucleotide Microarray Data David Elashoff UCLA Department of Biostatistics
Outline • Introduction to Affymetrix Microarrays • Data Preprocessing Methods • Within metric comparisons • Between metric comparisons • Results of a standard statistical analysis
DNA Microarrays in Publications • 1080 papers on the analysis of microarray data. (1997-2005) • 8052 papers specific to gene expression microarrays. (1995-2005)
Data Preprocessing • Five major techniques (MAS 4.0, MAS 5.0, Dchip (PMonly, Diff), RMA) + a number of newer techniques (SUM, PDNN, GCRMA, others) • Currently there is no agreement in the scientific community as to which method should be used.
Our project • How much does the choice of method impact the results of the data analysis? • Methods: 14 Human 133A data sets that are two group comparisons. • For each data set compute the expression indices using each of the 5 data preprocessing techniques. • With each of these 70 data sets perform a standard statistical analysis and compare the results. (~30 million values)
Within Metric Comparison • The first step is to examine how the different methods perform. • We compute for each gene in each data set using each method: 1. Two sample t-statistic 2. Fold-Change 3. Overall Mean 4. Sp: the within group variance estimator.
Scatter plots between rank percentage of mean expression of all genes in the two groups
Within Metric Comparinson Conclusion • RMA appears to have a number of properties that make it a better estimator. 1. Uncorrelated Mean and test statistic. 2. Uncorrelated Mean and standard deviation 3. Less difference in ranks overall between groups.
Plots and Spearman’s correlations of Mean expression measure
Hierarchical tree of expression values based on their correlations; a) tree shape is for all data sets
Standard Statistical Analysis • Two main components • Gene Filtering • Clustering/Classification • Gene Filtering uses combinations of comparison statistics to identify a small number of differentially expressed genes • Clustering/Classification uses combinations of genes to develop prediction models.
Gene Filtering • Wide variety of criteria • Statistical Tests (t-test / ANOVA / Regression / Survival) • Fold Change(FC), • Confidence interval for the fold change • Absent/Present Call • Absolute Difference • Significance Analysis of Microarrays (SAM) • Much literature on controlling false positive rates or false discovery rates (FDR).
Hierarchical tree of t-statistics based on their correlations; a) tree shape is for five data sets b) and c) in one set each
Assessing Agreement between methods • For each data set and each method and each test statistic (t-statistic and fold change) we find the subset with the 200 largest t-statistics or FC values.
Average % (Min~Max) of significant genes by the cut-off value used for testing
Agreement of Gene Lists The matrix of the % of average agreement on the most significant 200 genes Identified by t-statistics of each expression measure
Average agreement over 5 expression metrics on top 1000 (4.5%) significant genes – In each cell, # of genes (% of genes) agreed by the column # of expression when the row statistics used.
Plots of rank percentage of mean expression of 200 significant gene sets in the two groups
Gene Filtering Conclusions • This is a nightmare in terms of reproducibility. • Overall the methods are not identifying genes in different regions of the expression spectrum. • We know, that all methods produce gene lists that can be confirmed via RT-PCR* *Rosati,B., Frau,F., Kuehler,A., Rodriguea,S. and Mckinnon,D. (2004) Comparison of different probe-level analysis techniques for oligonucleotide microarrays. BioTechniques Vol. 36, 2:316-322
SAM • Can we compare the “quality” of the results between methods? • SAM is based on the permutation test • Using a variable cut-off it computes the FDR for varying numbers of “significant genes” • Does not function well on all data sets. • We used 40 data sets, 25 giving results and 18 with sufficient sample size to work well.
Clustering/Classification • Currently there is a huge literature on the application of every multivariate statistical method to the analysis of microarray data. • The techniques fall into two philosophical categories, unsupervised and supervised learning. • Typically we want to determine whether the microarray data can produce a classifier that correctly predicts the true classes. • There is no clear agreement on how many genes should be used for these methods
Unsupervised Learning (Clustering) • In each of the five expression indices for each of the seven data sets, samples are partitioned into two groups using the K-means clustering (we set K=2). • For the K-means clustering we use various subsets of the 22283 genes corresponding to typical gene filtering criteria. • 1) the subset of genes that are present in at least one sample (typically 5000-15000 genes) • 2) the subset of 5000 with the largest coefficient of variation (CV) • 3) the subset with the top 1000 CVs. • The Rand index is used to measure the level of agreement between predicted group assignment and the true group information.
Supervised Learning (Classification) • We use standard classification method, k-nearest neighbor (KNN) classification assess the ability of the methods to produce gene expression information that can accurately classify the samples from each data set. • 1. Exclude one sample to be used as a test case. Next, we find the top x genes based on t-statistics computed from the remaining samples. • 2. These genes form a new x dimensional space. Within that space, we compute the Euclidean distance between the left out sample and all other samples. • 3. The KNN classification rule assigns the left out sample to the class represented by a majority of its ‘k’ (k=3) nearest neighbors. • 4. This process is then iterated for each of the samples in the data set.
Leave-One-Out Cross-Validation in KNN (k=3) –Accuracy rate (%) with n-1 genes
Leave-One-Out Cross-Validation in KNN (k=3) –Accuracy rate (%) with top 1000 genes
Clustering Conclusions • No method gives consistently better results. • The number of genes used does not seem to matter. • There is an apparent advantage for MAS5 although not enough to make any real conclusions. • Interesting that each method appears to be producing information the can appropriately group the samples.
Final Conclusions • There is a large difference in the gene filtering results from each method. • We have no reason to think that one method is giving results that are more biologically relevant than another. • What do we do now?
Acknowledgements • Myungshin Oh • Fiona O’Kirwan • Nik Brown • Steve Horvath