Biomarker and Classifier Selection in Diverse Genetic Datasets

Biomarker and Classifier Selection in Diverse Genetic Datasets University Of Connecticut 1Department of Computer Science and Engineering 2Department of Molecular and Cell Biology James Lindsay1 Ed Hemphill2 Chih Lee1 Ion Mandoiu1 Craig Nelson2

Motivation 1: Cell-type Identification • The Question:Smallest # of genes to identify each cluster: • B: Bone • C: Myeloid • D: Endothelial • Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC

Motivation 2: Clinical Diagnostics • Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoSOne 2012

Multi-class Classification Problem Multi-class Classification • There are 2 or more classes • Supervised learning Key Problems: • Feature Selection: What are the most predictive biomarkers? • Classification: What is the best classification algorithm?

Challenges • Different types of data • Gene expression • Epigenetic data • Methylation • Histone modification • Proteomics • Metabolomics • Phenotypes • Different Platforms • Microarray • Sequencing • In-situ hybridization • Different Resolutions • Discrete vs Continuous • Sparse vs Complete

Minimal Unique Marker Panel Selection (Mumps) Pipeline Input: # of biomarkers: Inner Cross-validation Feature Selection Parameterize each combination of feature selection and classification algorithms Classification Nested Cross Validation Rank Models by AUC Output: the best features and classifier Outer Cross-validation

Algorithms Feature Selection Classification • (SVM)-recursive feature elimination (RFE) • ANOVA F-value • Random Forests • Extra Trees • Correlation • Cosine • K-Nearest Neighbors (KNN) • Support Vector Machine (SVM) • Decision Tree • Random Forests • Extra Trees • Gradient Boosting

Datasets • From Broad Institute • Affymetrix Gene expression microarray • 15 hematopoietic cell types • 82 samples • 4-7 samples per cell type. • Multiple Sources • 70 samples • Approximately 3-7 samples per cell type. • Affymetrix & Illumina Bead Array • Different labs

Experiments • Complete • Complete gene expression profile from microarray datasets. • Simulated Sparse • 70% and 50% missing data • Coverage of a marker followed a Beta distribution. • The fraction of cell types having known expression statuses for a marker. • Fifteen simulations • Cross-validation • 3-fold, stratified • # features: • 2, 8, 16, 32, 64, 96, 128, 256, and 384 • Best set of features and classifier for each # features • External validation • Use Broad data as training • Test against external datasets

Performance: Complete Data

By Algorithm: Complete Data

Performance: 70% Missing

Summary: Best Algorithms

Correlation: Broad vs External Why the Big Gap? Cross-platform normalization Similarities in cell-types Over-fitting

Motivation Results Mesoderm Cell-type Identification Anti-TNF Responsivness

Future Work • Broader Data-types • NCI-60 • microarray mRNA • microarray microRNA • copy number variation • protein array • SNPs • … • Minimizing over fitting • Cross-platform • normalization • Different Data types • Integrate multiple data types simultaneously

Conclusion and Thanks • Thanks to: • Ed Hemphill • Chih Lee • Ion Mandoiu • Craig Nelson Smpl Bio A commercial service coming in late 2013

Extra Slides Don’t Go Beyond, Tis A Silly Place

Experiment Overview Input: # of biomarkers: Inner Cross-validation Feature Selection Parameterize each combination of feature selection and classification algorithms Classification Nested Cross Validation Broad Data Rank Models by AUC Output the best features and classifier Outer Cross-validation External Testing Test Best Model Output: AUC of best features / classifier

Performance: 50% Missing

Biomarker and Classifier Selection in Diverse Genetic Datasets