550 likes | 701 Views
Domain agnostic tools for multi-scale/integrative sensor data analysis. Joel Saltz MD, PhD Stony Brook University. Radiology Imaging. Integrative Biomedical Informatics Analysis. Patient Outcome.
E N D
Domain agnostic tools for multi-scale/integrative sensor data analysis Joel Saltz MD, PhD Stony Brook University
Radiology Imaging Integrative Biomedical Informatics Analysis Patient Outcome • Reproducible anatomic/functional characterization at fine level (Pathology) and gross level (Radiology) • High throughput multi-scale image segmentation, feature extraction, analysis of features • Integration of anatomic/functional characterization with multiple types of “omic” information “Omic” Data Pathologic Features
Overview • Pathology Computer Aided Diagnosis • Integrative analysis of tissue: pathology, radiology, ‘omics’ and outcome • Management, query, analysis of integrative data • High end Computing tools for multi-scale analysis • Electronic health data: analytics, tools for Clinical phenotype characterization, population health
Pathology Computer Assisted Diagnosis Gurcan, Shamada, Kong, Saltz
Ganglioneuroma (Schwannian stroma-dominant) Maturing subtype Mature subtype absent FH Microscopic Neuroblastic foci absent Ganglioneuroblastoma, Intermixed (Schwannian stroma-rich) FH Grossly visible Nodule(s) present ≥50% Ganglioneuroblastoma, Nodular (composite, Schwannian stroma-rich/ stroma-dominant and stroma-poor) present UH/FH* Variant forms* Schwannian Development Undifferentiated subtype Any age UH Mitotic & karyorrhectic cells ≥200/5,000 cells Any age UH None to <50% Poorly differentiated subtype 100-200/5,000 cells ≥1.5 yr UH <100/5,000 cells <1.5 yr FH Neuroblastoma (Schwannian stroma-poor) ≥200/5,000 cells Any age UH ≥1.5 yr UH Differentiating subtype 100-200/5,000 cells <1.5 yr FH ≥5 yr UH <100/5,000 cells FH Neuroblastoma Classification <5 yr FH: favorable histology UH: unfavorable histology CANCER 2003; 98:2274-81
Background? Yes Image Tile Label Initialization I = L No Create Image I(L) Training Tiles Segmentation I = I -1 Down-sampling Feature Construction Segmentation Yes No I > 1? Feature Extraction Feature Construction Feature Extraction Classification Classifier Training Within Confidence Region ? No Yes TRAINING TESTING Computerized Classification System for Grading Neuroblastoma • Background Identification • Image Decomposition (Multi-resolution levels) • Image Segmentation (EMLDA) • Feature Construction (2nd order statistics, Tonal Features) • Feature Extraction (LDA) + Classification (Bayesian) • Multi-resolution Layer Controller (Confidence Region)
INTEGRATIVE ANALYSIS OF TISSUE: pathology, radiology, ‘omics’ and outcome
Quantitative Feature Analysis in Pathology: Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)
Using TCGA Data to Study Glioblastoma Diagnostic Improvement Molecular Classification Predictors of Progression
TCGA Network Digital Pathology Neuroimaging
Morphological Tissue Classification Whole Slide Imaging Cellular Features Nuclei Segmentation Lee Cooper, Jun Kong
Millions of Nuclei Defined by n Features Top-down analysis: use the features with existing diagnostic constructs
TCGA Whole Slide Images Step 1: Nuclei Segmentation • Identify individual nuclei and their boundaries Jun Kong
Nuclear Analysis Workflow Step 1: Nuclei Segmentation Step 2: Feature Extraction • Describe individual nuclei in terms of size, shape, and texture
Step 3: Nuclei Classification Nuclear Qualities 1 10 Astrocytoma Oligodendroglioma
Survival Analysis Human Machine
Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification Oligo Related Genes Myelin Basic Protein Proteolipoprotein HoxD1 Nuclear features most Associated with Oligo Signature Genes: Circularity (high) Eccentricity (low)
Millions of Nuclei Defined by n Features Bottom-up analysis: let nuclear features define and drive the analysis
Direct Study of Relationship Between Image FeaturesvsClinical Outcome, Response to Treatment, Molecular Information Lee Cooper, Carlos Moreno
Nuclear Features Used to Classify GBMs Consensus clustering of morphological signatures Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering
Clustering identifies three morphological groups • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides) • Named for functions of associated genes: Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB) • Prognostically-significant (logrank p=4.5e-4)
Molecular and Pathology Correlates of MR Features Using TCGA Data MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools MR Features compared to TCGA Transcriptional Classes, Genetic Alterations and Pathology NCI/in silico group led by Adam Flanders
Emory CTD2 Center: High throughput protein-protein interaction interrogation in cancer • Emory Molecular Interaction Center for Functional Genomics (MicFG) Principal Investigator and Director: Haian Fu Co-Directors: Fadlo R. Khuri, Joel Saltz Project Manager: Margaret Johns Aim 2 Leader Carlos Moreno Aim 1 Leader Yuhong Du Genomics informatics and data integration Cancer genomics-based HT PPI network discovery & validation Winship Cancer Institute Center for Comprehensive Informatics Emory Chemical Biology Discovery Center
Rich morphological and molecular characterizations of macroscopic tissue samples at microscopic resolution multi-Scale Imaging: Integrated structure and molecular characterization
Quantum Dot Immunohistochemistry, LCM + NGS, Imaging Mass Spec Genomics Excellent Molecular Resolution Limited Spatial Resolution Imaging Excellent Spatial Resolution Limited Molecular Resolution 1000’s of genes
Integrative Multi-scale Biomedical Informatics • Quantitative analyses of the interplay between morphology and spatially mapped genetics and molecular data to be used in studies that predict outcome and response to treatment • Assemble, visualize and quantify detailed, multi-scale descriptions of tissue morphologic changes originating from a wide range of microscopy instruments • Create/adapt computational and pattern recognition tools to integrate these descriptions with corresponding genomic, proteomic, glycomic, and clinical signatures.
Driving Biomedical Problems • Human: Lung Cancer Heterogeneity and Targeted Therapy (Khuri, Marcus) • Human: Gastrointestinal Cancer Risk Stratification and Prevention (Bostick, Baron) • Human and Mouse model: Glioma Microenvironment and Systems Biology (Brat, Mikkelsen) • Mouse model: Role of PTEN in the orchestrated sequence of events, leading to tumor initiation (Leone) • Mouse model: Role of Tn, STn tumor antigens in cancer initiation and progression, the impact of tissue-type specific alternations in Cosmc and the impact of altered expression of T-synthase (Cummings)
Tumor heterogeneity • Multiple definitions: • Genetic, epigenetic heterogeneity within tumor • Differences in microenvironments within tumor • Phenome differences within tumor • Heterogeneity involving primary and metastases • Characterization: • Imaging phenotype (radiology, pathology, optical…) • Molecular phenotype • Spatially characterized molecular phenotype (Laser captured microdissection, imaging mass spec, molecular imaging) • … Correlating Imaging Phenotypes with Genomic Signatures: Scientific Opportunities
Clinical Approach and Use • Development of imaging+analysis methods to characterize heterogeneity • within a tumor at one time point • evolution over time • among different tumor types • Development of imaging metrics that: • can predict and detect emergence of resistance? • correlates with genomic heterogeneity? • correlates with habitat heterogeneity? • can identify more homogeneous sub-types Correlating Imaging Phenotypes with Genomic Signatures: Scientific Opportunities
Radiology Imaging Patient Outcome “Omic” Data Pathologic Features Management, query, analysis of INTEGRATIVE DATA
Large Scale Spatial Query, Analysis and Data Management • Highly optimized spatial query and analyses • Hadoop/HDFS, IBM DB2, optimized CPU/GPU spatial algorithms • Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc. • Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships • Supported by two NLM R01 grants – Saltz/Foran PAIS Database • Implemented with IBM DB2 for large scale pathology image metadata (~million markups per slide) • Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc. • Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships • Support for high-level data statistical analysis
Spatial Centric – Pathology Imaging “GIS” Point query: human marked point inside a nucleus Window query: return markups contained in a rectangle . Containment query: nuclear feature aggregation in tumor regions Spatial join query: algorithm validation/comparison Fusheng Wang
VLDB 2012, 2013 Spatial Query, Change Detection, Comparison, and Quantification
Partnership with Oak Ridge National Laboratory (collaborators -- Scott Klasky, Jeff Vetter ) Also, aka Big Data High end Computing tools for multi-scale analysis
Macroscopic 3-D Tissue at Micron Resolution: OSU BISTI NBIB Center Big Data (2005) Associate genotype with phenotype Big science experiments on cancer, heart disease, pathogen host response Tissue specimen -- 1cm3 0.3 μ resolution – roughly 1013bytes Molecular data (spatial location) can add additional significant factor; e.g. 102 Multispectral imaging, laser captured microdissection, Imaging Mass Spec, Multiplex QD Multiple tissue specimens; another factor of103 Total: 1018bytes – exabyteper big science experiment
Integrate Information from Sensors, Images, Cameras • Multi-dimensional spatial-temporal datasets • Radiology and Microscopy Image Analyses • Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation • Biomass monitoring and disaster surveillance using multiple types of satellite imagery • Weather prediction using satellite and ground sensor data • Analysis of Results from Large Scale Simulations • Square Kilometer Array • Google Self Driving Car • Correlative and cooperative analysis of data from multiple sensor modalities and sources • Equivalent from standpoint of data access patterns – we propose a integrative sensor data mini-App
Core Transformations • Data Cleaning and Low Level Transformations • Data Subsetting, Filtering, Subsampling • Spatio-temporal Mapping and Registration • Object Segmentation • Feature Extraction • Object/Region/Feature Classification • Spatio-temporal Aggregation • Change Detection, Comparison, and Quantification
Runtime Support Objectives - (Similar to what is required for most applications discussed today!) • Coordinated mapping of data and computation to complex memory hierarchies • Hierarchical work assignment with flexibility capable of dealing with data dependent computational patterns, fluctuations in computational speed associated with power management, faults • Linked to comprehensible programming model – model targeted at abstract application class but not to application domain (In the sensor, image, camera case -- Region Templates) • Software stack including coordinated compiler/runtime support/autotuningframeworks
HPC Segmentation and Feature Extraction Pipeline Tony Pan, George Teodoro, TahsinKurc and Scott Klasky
Andrew Post, SharathCholleti, Doris Gao, Joel Saltz, Bill Bornstein Emory David Levine, Sam Hohmann, UHC Electronic health data: analytics, tools for Clinical phenotype characterization, population health
Clinical Phenotype Characterization and the Emory Analytic Information Warehouse • Find hot spots in readmissions within 30 days • Fraction of patients with a given principal diagnosis will be readmitted within 30 days? • Fraction of patients with a given set of diseases will be readmitted within 30 days? • How does severity and time course of co-morbidities affect readmissions? • Geographic analyses • Compare and contrast with UHC Clinical Data Base • Repeat analyses across all 180+ UHC hospitals • Hospital to hospital differences • Ability to predict readmissions across hospitals • Need a repeatable process that we can apply identically to both local and UHC data
Analytic Information Warehouse 5-year Datasets from Emory and University Healthcare Consortium • EUH, EUHM and WW (inpatient encounters) • Removed encounter pairs with chemotherapy and radiation therapy readmit encounters (CDW data) • Encounter location (down to unit for Emory) • Providers (Emory only) • Discharge disposition • Primary and secondary ICD9 codes • Procedure codes • DRGs • Medication orders (Emory only) • Labs (Emory only) • Vitals (Emory only) • Geographic information (CDW only + US Census and American Community Survey)
Geographic AnalysesUHC Medicine General Product Line (#15) Analytic Information Warehouse
Analytic Information Warehouse Predictive Modeling for Readmission • Random forests (ensemble of decision trees) • Create a decision tree using a random subset of the variables in the dataset • Generate a large number of such trees • All trees vote to classify each test example in a training dataset • Generate a patient-specific readmission risk for each encounter • Rank the encounters by risk for a subsequent 30-day readmission
Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest
Predictive Modeling Applied to 180 UHC HospitalsReadmission fraction of top 10% high risk patients