200 likes | 351 Views
Biomarker and Classifier Selection in Diverse Genetic Datasets. University Of Connecticut 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology. James Lindsay 1 Ed Hemphill 2 Chih Lee 1 Ion Mandoiu 1 Craig Nelson 2.
E N D
Biomarker and Classifier Selection in Diverse Genetic Datasets University Of Connecticut 1Department of Computer Science and Engineering 2Department of Molecular and Cell Biology James Lindsay1 Ed Hemphill2 Chih Lee1 Ion Mandoiu1 Craig Nelson2
Motivation 1: Cell-type Identification • The Question:Smallest # of genes to identify each cluster: • B: Bone • C: Myeloid • D: Endothelial • Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC
Motivation 2: Clinical Diagnostics • Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoSOne 2012
Multi-class Classification Problem Multi-class Classification • There are 2 or more classes • Supervised learning Key Problems: • Feature Selection: What are the most predictive biomarkers? • Classification: What is the best classification algorithm?
Challenges • Different types of data • Gene expression • Epigenetic data • Methylation • Histone modification • Proteomics • Metabolomics • Phenotypes • Different Platforms • Microarray • Sequencing • In-situ hybridization • Different Resolutions • Discrete vs Continuous • Sparse vs Complete
Minimal Unique Marker Panel Selection (Mumps) Pipeline Input: # of biomarkers: Inner Cross-validation Feature Selection Parameterize each combination of feature selection and classification algorithms Classification Nested Cross Validation Rank Models by AUC Output: the best features and classifier Outer Cross-validation
Algorithms Feature Selection Classification • (SVM)-recursive feature elimination (RFE) • ANOVA F-value • Random Forests • Extra Trees • Correlation • Cosine • K-Nearest Neighbors (KNN) • Support Vector Machine (SVM) • Decision Tree • Random Forests • Extra Trees • Gradient Boosting
Datasets • From Broad Institute • Affymetrix Gene expression microarray • 15 hematopoietic cell types • 82 samples • 4-7 samples per cell type. • Multiple Sources • 70 samples • Approximately 3-7 samples per cell type. • Affymetrix & Illumina Bead Array • Different labs
Experiments • Complete • Complete gene expression profile from microarray datasets. • Simulated Sparse • 70% and 50% missing data • Coverage of a marker followed a Beta distribution. • The fraction of cell types having known expression statuses for a marker. • Fifteen simulations • Cross-validation • 3-fold, stratified • # features: • 2, 8, 16, 32, 64, 96, 128, 256, and 384 • Best set of features and classifier for each # features • External validation • Use Broad data as training • Test against external datasets
Correlation: Broad vs External Why the Big Gap? Cross-platform normalization Similarities in cell-types Over-fitting
Motivation Results Mesoderm Cell-type Identification Anti-TNF Responsivness
Future Work • Broader Data-types • NCI-60 • microarray mRNA • microarray microRNA • copy number variation • protein array • SNPs • … • Minimizing over fitting • Cross-platform • normalization • Different Data types • Integrate multiple data types simultaneously
Conclusion and Thanks • Thanks to: • Ed Hemphill • Chih Lee • Ion Mandoiu • Craig Nelson Smpl Bio A commercial service coming in late 2013
Extra Slides Don’t Go Beyond, Tis A Silly Place
Experiment Overview Input: # of biomarkers: Inner Cross-validation Feature Selection Parameterize each combination of feature selection and classification algorithms Classification Nested Cross Validation Broad Data Rank Models by AUC Output the best features and classifier Outer Cross-validation External Testing Test Best Model Output: AUC of best features / classifier