450 likes | 733 Views
Class prediction for experiments with microarrays. Lara Lusa Inštitut za biomedicinsko informatiko Medicinska fakulteta Lara.Lusa at mf.uni-lj.si. Outline. Objectives of microarray experiments Class prediction What is a predictor? How to develop a predictor?
E N D
Class prediction for experiments with microarrays Lara Lusa Inštitut zabiomedicinsko informatikoMedicinska fakulteta Lara.Lusa at mf.uni-lj.si
Outline • Objectives of microarray experiments • Class prediction • What is a predictor? • How to develop a predictor? • Which are the available methods? • Which features should be used in the predictor? • How to evaluate a predictor? • Internal v External validation • Some examples of what can go wrong • The molecular classification of breast cancer
Scheme of an experiment Study design Performance of the experiment Sample preparation Hybridization Image analysis Quality control and normalization Data analysis Class comparison Class prediction Class discovery Interpretation of the results
Aims of high-throughput experiments • Class comparison - supervised • establish differences in gene expression between predetermined classes (phenotypes) • Tumor vs. Normal tissue • Recurrent vs. Non-recurrent patients treated with a drug (Ma, 2004) • ER+ vs ER- patients (West, 2001) • BRCA1, BRCA2 and sporadics in breast cancer (Hedenfalk, 2001) • Class prediction - supervised • prediction of phenotype using gene expression data • morphology of a leukemia patient based on his gene expression (ALL vs. AML, Golub 1999) • which patients with breast cancer will develop a distant metastasis within 5 years (van’t Veer, 2002) • Class discovery - unsupervised • discover groups of samples or genes with similar expression • Luminal A, B, C(?), Basal, ERBB2+, Normal in Breast Cancer (Perou 2001, Sørlie, 2003)
How to develop a predictor? • On a training set of samples • Select a subset of genes (feature selection) • Use gene expression measurements (X) Predict class membership (Y) of new samples (test set) Obtain a RULE (g(X)) based on gene-expression for the classification of new samples
Rule: Nearest-neighbor classifier • For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each sample from the “test” • Classification rule: assign the new sample to the class to which belongs the samples from the training set which has the highest correlation with the new sample Samples from training set correlation new sample Bishop, 2006
Rule: K-Nearest-neighbor classifier • For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each samplefrom the “test” • Classification rule: assign the new sample to the class to which belong the majority of the samples from the training set which have the K highest correlation with the new sample Samples from training set correlation new sample K=3 Bishop, 2006
Rule: Method of centroids (Sørlie et al. 2003) • Method of centroids – class prediction rule: • Define a centroid for each class on the original data set (“training set”) • For each gene, average its expression from the samples assigned to that class • For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each centroid • Classification rule: Assign the sample to the class for which the centroid has the highest correlation with the sample (if below .1 do not assign) centroids correlation new sample Assigned to the class which centroid has highest correlation with the new sample
Rule: Diagonal Linear Discriminant Analysis (DLDA) • Calculate mean expression of samples from Class 1 and Class 2 in the training set for each of the G genes • and the pooled within class variance • For each sample x* of the test set evaluate if • where x*j is the expression of the j-th gene for the new sample • Classification rule: if the above inequality is satisfied, classify the sample in Class 1, otherwise to Class 2.
Rule: Diagonal Linear Discriminant Analysis (DLDA) • Particular case of discriminant analysis with the hypotheses that • the feature are not correlated • the variances of the two classes are the same • Other methods used in microarray studies are variants of discriminant analysis • Compound covariate predictor • Weighted vote method Bishop, 2006
Other popular classification methods • Classification and Regression Trees (CART) • Prediction Analysis of Microarrays (PAM) • Support Vector Machines (SVM) • Logistic regression • Neural networks Bishop, 2006
How to choose a classification method? • No single method is optimal in every situation • No Free Lunch Theorem: in absence of assumptions we should not prefer any classification algorithm over another • Ugly Ducking Theorem: in absence of assumptions there is no “best” set of features
The bias-variance tradeoff Hastie et al, 2001 MSE=ED[ (g(x; D) – F(x))2]= = ( ED[ g(x; D) – F(x) ] )2 +ED[ ( g(x; D) –ED [g(x;D)] )2 ]= = Bias2+Variance Duda et al, 2001
Feature selection • Can ALL the gene expression variables be included in the classifier? • Which variables should be used to build the classifier? • Filter methods • Prior to building the classifier • One feature at a time or joint distribution approaches • Wrapper methods • Performed implicitly by the classifier • CART, PAM From Fridlyand, CBMB Workshop
A comparison of classifiers’ performance for microarray data • Dudoit, Fridlyand and Speed -2002, JASA on 3 data sets • DA, DLDA, k-NN, SVM, CART • Good performance of simple classifiers as DLDA and NN • Feature selection: small number of features included in the classifier
How to evaluate the performance of a classifier • Classification error • A sample is classified in a class to which it does not belong • g(X) ≠ Y • Predictive accuracy=% of correctly classified samples • In a two-class problem, using the terminology from diagnostic tests (“+”=diseased, “-”=healthy) • Sensitivity = P(classified +| true +) • Specificity = P(classified -| true -) • Positive predictive value = P( true +| classified + ) • Negative predictive value = P( true - | classified -)
Class prediction: how to assess the predictive accuracy? • Use an independent data set • If it is not available? • ABSOLUTELY WRONG: • Apply your predictor to the data you used to develop it and see how well it predicts • OK • cross validation • bootstrap data train test train train train train train test test test test test
train test train train test test How to develop a cross-validated class predictor • Training set • Test set • Predict class using class predictor from test set
Dupuy and Simon, JNCI 2007 Supervised prediction 12/28 reported a misleading estimate of prediction accuracy 50% of studies contained one or more major flaws
Class prediction: a famous example van’t Veer et al. report results obtained with wrong analysis in the paper and correct analysis (with less striking results) just in the supplementary material
What went wrong? Produces highly biased estimates of predictive accuracy + Going beyond the quantification of predictive accuracy and attempting to make inference with cross-validated class predictor: INFERENCE MADE IS NOT VALID
Observed Hypothesis: there is no difference between classes Prop. of rejected H0 0.01 0.05 0.10 LOO CV 0.268 0.414 0.483 (n = 100) Lusa, McShane, Radmacher, Shih, Wright, Simon, Statistics in Medicine, 2007 Microarray predictor Parameter Logistic Coeff Std. Error Odds ratio 95% CI ---------------------------------------------------------------------------------------------------------- Grade -0.08 0.79 1.1 [0.2 5.1] ER 0.5 0.94 1.7 [0.3 10.4] PR -0.75 0.93 2.1 [0.3 13.1] size (mm) -1.26 0.66 3.5 [1.0 12.8] Age 1.4 0.79 4 [0.9 19.1] Angioinvasion -1.55 0.74 4.7 [1.1 20.1] Microarray 2.87 0.851 7.6 [3.3 93.7] ---------------------------------------------------------------------------------------------------------- Odds ratio=15.0, p-value=4 *10^(-6)
Final remarks • Simple classification methods such as LDDA have proved to work well for microarray studies and outperform fancier methods • A lot of classification methods which have been proposed in the field with new names are just slight modifications of already known techniques
Final remarks • Report all the necessary information about your classifier so that other can apply it to their data • Evaluate correctly the predictive accuracy of the classifier • in “early microarray times”, many papers presented analyses that were not correct, or drew wrong conclusions from their work. • still now, middle and low IF journals keep publishing obviously wrong analyses • Don’t apply methods without understanding exactly • what they are doing • on which assumptions they rely
Other issues in classification • Missing data • Class representation • Choice of distance function • Standardization of observations and variables • An example where all this matters…
Class discovery • Mostly performed through hierarchical clustering of genes and samples • Often abused method in microarray analysis, used instead of supervised methods • In very few examples • stability and reproducibility of clustering is assessed • results are“validated” or further used after “discovery” • a rule for classification of new samples is given • “Projection” of the clustering to new data sets seems still problematic It becomes a class prediction problem
Molecular taxonomy of breast cancer • Perou/Sørlie (Stanford/Norway) • Class sub-type discovery (Perou, Nature 2001, Sørlie, PNAS 2001, Sørlie, PNAS 2003) • Association of discovered classes with survival and other clinical variables (Sørlie, PNAS 2001, Sørlie, PNAS 2003) • Validation of findings assigning class labels defined from class discovery to independent data sets (Sørlie, PNAS 2003)
Sørlie et al, PNAS 2003 10 (>.31) 2/3 n=79 (64%) (ρ) 28 (>.32) 89% 11 (>.28) 82% 11 (>.34) 64% 19 (>.41) 22% ER + Hierarchical clustering of the 122 samples from the paper using the “intrinsic gene-set” (~500 genes) Average linkage and distance= 1- Pearson’s (centered) correlation Number of samples in each class (node correlation for the core samples included for each subtype) and percentage of ER positive samples
Can we assign subtype membership to samples from independent data sets? • Method of centroids – class prediction rule: • Define a centroid for each class on the original data set (“training set”) • For each gene, average its expression from the samples assigned to that class • For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each centroid • Classification rule: Assign the sample to the class for which the centroid has the highest correlation with the sample (if below .1 do not assign) Sørlie et al. 2003 centroids correlation Assigned to the class which centroid has highest correlation with the new sample new sample • Cited thousands of times • Widely used in research papers and praised in editorials • Recent concerns raised about their reproducibility and robustness West data set
Predicted class membership Sørlie our data • Loris: “I obtained the subtypes on our data! All the samples from Tam113 are Lum A, a bit strange... there are no Lum B in our data set” • Lara: “Have you tried also on the BRCA60?” • Loris: “No [...] Those are mostly LumA, too. Some are Normal, very strange..there are no basal among the ER-!” • Lara: “[...] Have you mean-centered the genes?” • Loris:” No [...] Looks better on BRCA60: Now the ER- of are mostly basal... On Tam113 I get many lumB... But 50% of the samples from Tam113 are NOT luminal anymore!” Something is wrong! BRCA60: Hereditary BRCa (42ER+/16ER-) Tam113: Tamoxifen treated BR Ca 113 ER+/ 0 ER-
How are the systematic differences between microarray platforms/batches taken into account? • Sørlie’s et al 2003 data set • Genes were mean (and eventually median) centered “[…], the data file was adjusted for array batch differencesas follows; on a gene-by-gene basis, we computed the mean of the nonmissing expression values separately in each batch. Then for each sample and each gene, we subtracted its batch mean for that gene. Hence, the adjusted array would have zero row-means within each batch. This ensures that any variance in a gene is not a result of a batch effect.” “Rows (genes) were median-centered and both genes and experiments were clustered by using an average hierarchical clustering algorithm.” • West et al data set(Affymetrix, single channel data) • Genes were “centered” “Data were transformed to a compatible format by normalizing to the median experiment […] Each absolute expression value in a given sample was converted to a ratio by dividing by its average expression value across all samples.” • van’t Veer et al data set • Genes do not seem to have been mean-centered • Other data sets where the method was applied • Genes were always centered Mean-centering ER- ER+
Possible concerns on the application of the method of centroids • How are the classification results influenced by... • normalization of the data (mean-centering of the genes)? • differences in subtype prevalence across data sets? • presence of study (or batch) effects? • choice of the method of centroids as a classification method? • the use of the arbitrary cut-off for non classifiable samples? Lusa et al, Challenges in projecting clustering results across gene expression-profiling datasets JNCI 2007
ER (Ligand-Binding Assay): 34 ER-/65 ER+ 7650 clones (6878 unique)
1. Effects of mean-centering the genes method of centroids centered (C) Sorlie’s centroids (derived from centered data set) Sotiriou’s data set 336/552 common and unique clones non centered (N) ER+ subset (65 samples) ER- subset (34 samples) full data set (99 samples)
2. Effects of prevalence of subgroups in (training and) testing set? Predictive accuracy ER+ / ER- 10 ER+/ 10 ER- Test set 55 ER+/ 24 ER- 95% / 79% 55 ER+/ 24 ER- 78% / 88% 24 ER+/ 24 ER- 88% / 83% 12 ER+/ 24 ER- 92% / 79% 55 ER+/ 0 ER- 53% / ND 0 ER+/ 24 ER- ND / 62%
2b.What is the role played by prevalence of subgroups in training and testing set? ER status predictionSotiriou’s data set multiple (100) random SPLITS testing training method of centroids Testing set Training set 751 variance filtered unique clones (C) (C) (N) (N) 0≤ ωtest ≤1 (ntest=24) 0 ER+/24ER- 1 ER+/23ER- … 24 ER+/0ER- ωtr=1/2 (ntr=20) 10 ER+/10ER- ω :% of ER+ samples in the testing set % correctly classified in class of ER+ % correctly classified in class of ER- % of correctly classified overall
van’t Veer (Centered) van’t Veer (Non centered) 3. (Possible) study effect on real dataSotiriou van’t Veer Predicted class membership • The predictive accuracy is the same • Most of the samples in the non-centered analysis would not be classificable using the threshold
Conclusions I • “Must”s for a clinically useful classifier • It classifies unambiguously a new sample, independently of any other samples being considered for classification at the same time • The clinical meaning of the subtype assignment (survival probability, probability of response to treatment) must be stable across populations to which the classifier might be applied • The technology used to assay the samples must be stable and reproducible – sample assayed on different occasions assigned to the same subtype • BUT we showed that subgroup assignments of new samples can be substantially influenced by • Normalization of data • Appropriateness of gene-centering depends on the situation • Proportion of samples from each subtype in the test set • Presence of systematic differences across data sets • Use of arbitrary rules for identifying non-classifiable samples • Most of our conclusions apply also to different classification method
Conclusions II • Most of the studies claiming to have validated the subtypes have focused only on comparing clinical outcome differences • Shows consistency of results between studies • BUT does not provide direct measure of the robustnessof the classification essential before using the subtypes in clinical practice • Careful thought must be given to comparability of patient populations and datasets • Many difficulties remain in validating and extending class discovery results to new samples and a robust classification rule remains elusive The subtyping of breast cancer seems promising BUT a standardized definition of the subtypes based on a robust measurement method is needed
Some useful resources and readings • Books • Simon et al. – Design and Analysis of DNA Microarray Investigations – Ch.8 • Speed (Ed.) – Statistical Analysis of Gene Expression Microarray Data – Ch.3 • Bishop- Pattern Recognition and Machine Learning • Hastie, Tibshirani and Friedman – The Elements of Statistical Learning • Duda, Hart and Stork – Pattern Classification • Software for data analysis • R and Bioconductor (www.r-project.org, www.bioconductor.org) • BRB Array Tools (http:// linus.nci.nih.gov) • Web sites • BRB/NCI web site (NIH) • Tibshirani’s web site (Stanford) • Terry Speed’s web site (Berkley)