70 likes | 332 Views
Multiclass classification of microarray data with repeated measurements: application to cancer. Ka Yee Yeung & Roger E Bumgarner Genome Biology 2003, 4 :R83. Sample Classification.
E N D
Multiclass classification of microarray data with repeated measurements: application to cancer Ka Yee Yeung & Roger E Bumgarner Genome Biology 2003, 4:R83
Sample Classification • Use gene expression measurements from microarray experiments to classify biological sample (e.g. types of tumors). • Goals • Utilize Repeated Measurements • Multiclass classification • Remove redundancy • No assumption of distribution
Shrunken Centroid Classification • Feature selection • Consider features individually • Calculate overall centroid and each class centroid • “Shrink” class centroids by factor Δ • Compare shrunken class centroids to overall centroid • If significantly different, feature is predictive for the class • Estimate optimum Δ using 10-fold cross validation • Classification • Calculate standardized, squared difference of sample to each shrunken class centroid for selected features • Assign to class with nearest centroid
Redundancy & Error Estimation • Uncorrelated Shrunken Centroid (USC) • Removes redundant genes • For each set of relevant genes • Compute pairwise correlations • Remove least relevant gene from pairs with correlation above given threshold • Use cross-validation to determine best pair (shrinkage factor, correlation threshold) • Error Weighted Uncorrelated SC (EWUSC) • The standard deviation of the sample mean is used to down weight the most variable genes and experiments
Experiments • Datasets • Synthetic datasets, varying: • Biological noise level • Technical noise level • Number of repeated measurements • Percent of relevant genes • Real Datasets • Multiple tumor dataset – 7,129 genes, 123 samples, 11 classes (types of tumors) • Breast cancer dataset – 25,000 genes, 97 samples, 2 classes (good or poor prognosis) • Evaluation Criteria • Prediction Accuracy • Number of relevant features selected • Feature stability
Synthetic data results • Removing redundant genes (USC) = Similar accuracy + Using same or fewer genes • Error weighting results on synthetic datasets • Two types of error defined • Technical noise – variation over repeated measurements (λ) • Low (1) or High (5, 10) + Handled “technical noise” well (similar accuracy similar, fewer genes) • Biological noise – signal to noise ratio (α) • 20 to 1, 2 to 1, or 1 to 1 • Accuracy was worse with increased “biological noise”, despite increasing number repeated measurements • Criticism • Noise same over entire dataset, should vary for different genes • Each dataset would have some high signal to noise genes
Real Data Results • Removing redundant genes (USC) = Similar, but varying accuracy + Using many fewer genes • Error weighting – Real Datasets • Multiple tumor data + Improved accuracy + Improved feature stability = Using similar number of genes • Breast cancer data + Improved accuracy = Similar feature stability – Using increased number of genes