500 likes | 715 Views
Myths and Statistical Principles in DNA Microarray Research. Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics National Cancer Institute. All cells of a multi-cellular organism contain essentially the same DNA
E N D
Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics National Cancer Institute
All cells of a multi-cellular organism contain essentially the same DNA • Cells differ in function based on the spectra of which genes are expressed and the level of expression • Proteins do the work of cells and gene expression determines the intra-cellular concentration of proteins • mRNA is an intermediate product of gene expression; a gene is transcribed into a mRNA molecule which is then translated into a protein molecule
Types of DNA Microarrays • mRna transcript quantification • Genomic DNA sequence determination • SNP identification • Genotyping • Detecting gene deletions or gene duplications
Types of Microarrays • DNA microarrays • Tissue microarrays • Protein microarrays
Biology in Transition • Biotechnology • Restriction enzymes • Ligases • Polymerases • PCR • Instruments, Tools, Reagents and Information Resources of Major Impact • DNA sequencing • Functional whole genomic assays
How to Deal With the Plethora of Data • Development of software tools • Training of biologists to use tools • Collaboration with mathematical & computational scientists • Training of mathematical & computational scientists
Bioinformatics • An ambiguous term that helps further confuse people who are sometimes already confused • Refers to a range of activities all of which involve multi-disciplinary collaboration among biological, mathematical, computational scientists and software engineers • Organizations searching for structures that will support quality inter-disciplinary research in bioinformatics
Organizing for Bioinformatics • Collaborative, not service oriented • Enable extensive interaction and education • Enable scientists to be stimulated by important problems and to accomplish organizational and personal goals in solving them
Molecular Statistics & Bioinformatics Section • Utilize mathematical and computational sciences in conjunction with data from genomics & high thruput technologies to elucidate the biological basis of cancer • translating this to effective means of eradicating cancer • Train statisticians, mathematicians, physical and biological scientists in cancer computational biology
Microarray Research • Collaborative data analysis • Methodology development • Software development
Microarray Myths • That the greatest challenge is managing the mass of micro-array data • That pattern-recognition or data mining are the most appropriate paradigm for the analysis of micro-array data • That pre-packaged analysis tools are a substitute for collaboration with statistical scientists in complex problems • That statistical collaboration can be a service function • That statisticians can be effective collaborators without substantial knowledge of biology and microarray technology
Applications of DNA Microarrays to Cancer Research • Identify genes and pathways involved in oncogenesis • Transgenic mouse models • Profiling pre-cancerous lesions • Identifying molecular targets for • therapeutics • early detection
Applications of DNA Microarrays to Cancer Research • Diagnostic classification • For identifying disease subsets with distinctive pathogenesis • For selecting therapy • Large cell lymphoma • Stage I breast cancer
Design issues Arrays Specimens Labeling Replication Image analysis Pixels to feature Feature analysis Background adjustment Normalization Features to genes Normalization Analysis of biological objectives DNA Microarray Analytics
Method of Analysis Should Be Tailored to Objectives • Class discovery • Identifying expression profiles characteristic of non-predefined subsets of tumors • Class/phenotype prediction • Identifying expression profiles that distinguish predefined subsets of tumors
Components of Class Prediction • Establish that expression “profiles” differ to a statistically significant degree and that differences observed are not due to examination of thousands of genes • Identify genes that account for the differences between classes • Develop multi-gene classifier to predict the class for a new sample and estimate the mis-classification rates
Do Expression Profiles Differ for Two Defined Classes of Arrays? • Not a clustering problem • Global similarity measures generally used for clustering arrays may not distinguish classes • Supervised vs unsupervised methods • Requires multiple biological samples from each class
Do Expression Profiles Differ for Two Defined Classes of Arrays? • Global test • Number of genes significantly differentially expressed among classes at specified nominal significance level • Cross-validated mis-classification rate • Multiple comparison adjustment for finding differentially expressed genes • Experiment-wise error • Univariate screening with p<0.001 threshold • False discovery rate
log-expression ratios full data set specimens log-expression ratios training set specimens test set Non-cross-validated Prediction 1. Prediction rule is built using full data set. 2. Rule is applied to each specimen for class prediction. Cross-validated Prediction (Leave-one-out method) 1. Full data set is divided into training and test sets (test set contains 1 specimen). 2. Prediction rule is built using the training set. 3. Rule is applied to the specimen in the test set for class prediction. 4. Process is repeated until each specimen has appeared once in the test set.
Prediction on Simulated Null Data • Generation of Gene Expression Profiles • 14 specimens (Pi is the expression profile for specimen i) • Log-ratio measurements on 6000 genes • Pi ~ MVN(0, I6000) • Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? • Prediction Method • Compound covariate prediction (discussed later) • Compound covariate built from the log-ratios of the 10 most differentially expressed genes.
Exact Permutation Test Premise: Under the null hypothesis of no systematic difference in expression profiles between the two classes, it can be assumed that assignment of class labels to expression profiles is purely coincidental. Performing the test 1. Consider every possible permutation of the class labels among the gene expression profiles. 2. Determine the proportion of the permutations that result in a misclassification error rate less than or equal to the observed error rate. 3. This proportion is the achieved significance level in a test of the null hypothesis.
Monte Carlo Permutation Test • Examining all permutations is computationally burdensome. • Instead, a Monte Carlo method is used… • nperm permutations of the labels are randomly generated. • The proportion of these permutations that have m or fewer misclassifications is an estimate of the achieved significance level in a test of the null hypothesis. • nperm is chosen such that the variability in the estimate is less than an acceptable level. • If the true proportion of permutations with m£ 2 is 0.05, nperm= 2000 ensures the coefficient of variation of the estimate of the achieved significance level is less than 0.1.
cDNA Microarrays Parallel Gene Expression Analysis Gene-Expression Profiles in Hereditary Breast Cancer • Breast tumors studied: • 7 BRCA1+ tumors • 8 BRCA2+ tumors • 7 sporadic tumors • Log-ratios measurements of 3226 genes for each tumor after initial data filtering RESEARCH QUESTION Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?
The Compound Covariate Predictor (CCP) • We consider only genes that are differentially expressed between the two groups (using a two-sample t-test with small a). • The CCP • Motivated by J. Tukey, Controlled Clinical Trials, 1993 • Simple approach that may serve better than complex multivariate analysis • A compound covariate is built from the basic covariates (log-ratios) tj is the two-sample t-statistic for gene j. xijis the log-ratio measure of sample i for gene j. Sum is over all differentially expressed genes. • Threshold of classification: midpoint of the CCP means for the two classes.
Accuracy of class prediction as selection stringency increases
Advantages of Compound Covariate Classifier • Good feature selection • Does not over-fit data • Incorporates influence of multiple predictive variables without attempting to select the best small subset of variables • Does not attempt to model the multivariate interactions among the predictors and outcome
Extensions • Adjustment for covariates • Paired samples • Survival data • Other classification methods • More than 2 classes
Class Discovery • For determining whether a set of tumors is homogeneous with regard to expression profile
Class Discovery Methods • Cluster analysis • Multi-dimensional Scaling
Melanoma Gene Expression Data Q: Can gene expression profiles of melanoma be used to distinguish sub-classes of disease?(M. Bittner et al., Nature Genetics Aug 2000) 1 - correlation 19 tumor cluster of interest
Validation of Clusters • Clustering algorithms find clusters, even when they are spurious • Clusters found may change with re-assaying tumors or selection of new tumors
Clustering Arrays • Cluster significance • Cluster reproducibility
Cluster reproducibility • Add perturbation noise to original data • Re-cluster perturbed data to assess stability of original clusters • D: Proportion of pairs of samples in a specified cluster of the original data that are in separate clusters after perturbation • R: Average number of specimens lost or gained in a specified cluster || CP(C) - CP(C) ||
Test of Cluster Significance • Multivariate Gaussian null hypothesis • Project to subspace determined by first three principal components • Compute EDF of nearest neighbor Euclidean distances between samples • Compare the NN EDF observed to that expected under the null distribution using a squared difference discrepancy metric • Estimate null distribution by sampling from 3D Gaussian distribution with mean and covariance matrix corresponding to observed data
BRB ArrayTools:An integrated package for the analysis of DNA microarray data http://linus.nci.nih.gov/BRB-ArrayTools.html
Easy user interface Excel front-end Ease of data loading integrated Drill-down linkage to genomic databases Educating biologists in microarray data analysis Powerful analytic & visualization tools Easily extensible R backend Portable Non-proprietary Ease of development R back-end BRB ArrayToolsDesign Objectives
Collaborators • Molecular Statistics & Bioinformatics • Kevin Dobbin • Lisa McShane • Amy Peng • Michael Radmacher • Joanna Shih • George Wright • Yingdong Zhao