1 / 20

Biomarker and Classifier Selection in Diverse Genetic Datasets

Biomarker and Classifier Selection in Diverse Genetic Datasets. University Of Connecticut 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology. James Lindsay 1 Ed Hemphill 2 Chih Lee 1 Ion Mandoiu 1 Craig Nelson 2.

toki
Download Presentation

Biomarker and Classifier Selection in Diverse Genetic Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biomarker and Classifier Selection in Diverse Genetic Datasets University Of Connecticut 1Department of Computer Science and Engineering 2Department of Molecular and Cell Biology James Lindsay1 Ed Hemphill2 Chih Lee1 Ion Mandoiu1 Craig Nelson2

  2. Motivation 1: Cell-type Identification • The Question:Smallest # of genes to identify each cluster: • B: Bone • C: Myeloid • D: Endothelial • Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC

  3. Motivation 2: Clinical Diagnostics • Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoSOne 2012

  4. Multi-class Classification Problem Multi-class Classification • There are 2 or more classes • Supervised learning Key Problems: • Feature Selection: What are the most predictive biomarkers? • Classification: What is the best classification algorithm?

  5. Challenges • Different types of data • Gene expression • Epigenetic data • Methylation • Histone modification • Proteomics • Metabolomics • Phenotypes • Different Platforms • Microarray • Sequencing • In-situ hybridization • Different Resolutions • Discrete vs Continuous • Sparse vs Complete

  6. Minimal Unique Marker Panel Selection (Mumps) Pipeline Input: # of biomarkers: Inner Cross-validation Feature Selection Parameterize each combination of feature selection and classification algorithms Classification Nested Cross Validation Rank Models by AUC Output: the best features and classifier Outer Cross-validation

  7. Algorithms Feature Selection Classification • (SVM)-recursive feature elimination (RFE) • ANOVA F-value • Random Forests • Extra Trees • Correlation • Cosine • K-Nearest Neighbors (KNN) • Support Vector Machine (SVM) • Decision Tree • Random Forests • Extra Trees • Gradient Boosting

  8. Datasets • From Broad Institute • Affymetrix Gene expression microarray • 15 hematopoietic cell types • 82 samples • 4-7 samples per cell type. • Multiple Sources • 70 samples • Approximately 3-7 samples per cell type. • Affymetrix & Illumina Bead Array • Different labs

  9. Experiments • Complete • Complete gene expression profile from microarray datasets. • Simulated Sparse • 70% and 50% missing data • Coverage of a marker followed a Beta distribution. • The fraction of cell types having known expression statuses for a marker. • Fifteen simulations • Cross-validation • 3-fold, stratified • # features: • 2, 8, 16, 32, 64, 96, 128, 256, and 384 • Best set of features and classifier for each # features • External validation • Use Broad data as training • Test against external datasets

  10. Performance: Complete Data

  11. By Algorithm: Complete Data

  12. Performance: 70% Missing

  13. Summary: Best Algorithms

  14. Correlation: Broad vs External Why the Big Gap? Cross-platform normalization Similarities in cell-types Over-fitting

  15. Motivation Results Mesoderm Cell-type Identification Anti-TNF Responsivness

  16. Future Work • Broader Data-types • NCI-60 • microarray mRNA • microarray microRNA • copy number variation • protein array • SNPs • … • Minimizing over fitting • Cross-platform • normalization • Different Data types • Integrate multiple data types simultaneously

  17. Conclusion and Thanks • Thanks to: • Ed Hemphill • Chih Lee • Ion Mandoiu • Craig Nelson Smpl Bio A commercial service coming in late 2013

  18. Extra Slides Don’t Go Beyond, Tis A Silly Place

  19. Experiment Overview Input: # of biomarkers: Inner Cross-validation Feature Selection Parameterize each combination of feature selection and classification algorithms Classification Nested Cross Validation Broad Data Rank Models by AUC Output the best features and classifier Outer Cross-validation External Testing Test Best Model Output: AUC of best features / classifier

  20. Performance: 50% Missing

More Related