250 likes | 259 Views
This case study explores the integration of semantics and numerics to enhance genomic and disease data analysis. The study demonstrates the use of semantic technologies to integrate and analyze data from multiple sources, enabling advanced visualizations and mathematical analytics.
E N D
Integrating Semantics & Numerics: Case Study on Enhancing Genomic and Disease Data Using Linked Data Technologies (Semantic Technologies Meets Data Analysis) Deborah L. McGuinness Tetherless World Senior Constellation Chair Professor of Computer Science and Cognitive Science Director RPI Web Science Research Center RPI Institute for Data Exploration and Applications Health Informatics Lead Thanks to the extended RPI Tetherless World & SemNExT Teams: in particular Kristin Bennett and Evan Pattonas well as the rest of the SemNExT team: Elisabeth Brown, Hannah De Los Santos, Spencer Norris, Matt Poegel, & the ReDrugS team: Jim McCusker, Michel Dumontier, Rui Yan
Motivation • Data and data analysis tasks are exploding and the tasks are often time consuming and results are often difficult to understand for non-experts • Semantic representation languages and environments are available and enjoying increased usage • Structured vetted and maintained resources are increasingly available on the web, particularly in bioinformatics • Two groups (McGuinness – Semantic Technologies & Bennett – Data Analysis) had maturing processes that we believed could be improved if integrated • We believed tooling support could be produced to help identify and link experimental bioinformatics data and analyses with relevant semantic knowledge
SemNExT – Semantic Numeric Exploration Technology • Developing next generation integrated semantic/numeric data exploration and analysis. • Joint analysis of experimental and semantic data streams from several sources. • Combination of multiple statistical and machine learning techniques with semantically encoded domain knowledge. • Advanced interactive visualizations mathematical data analytics techniques with full semantic markup designed for RPI unique platforms. Joint work between McGuinness’ Semantics group and Bennett’s Applied Math group and leveraging Experimental Multimedia Performing Arts Center (EMPAC) infrastructure http://tw.rpi.edu/web/project/SemNExT
Background: Ontologies An ontology specifies a rich description of the • Terminology, concepts, nomenclature • Relationships among concepts and individuals • Sentences distinguishing concepts, refining definitions and relationships (constraints, restrictions, regular expressions) relevant to a particular domain or area of interest. * Based on AAAI ‘99 Ontologies Panel ̶ McGuinness, Welty, Uschold, Gruninger, Lehmann
SemNExT Workflow • Identify data sources and ontologies • Generate ontology and instance mappings between sources (e.g., identifiers) • Identify appropriate statistical analyses based on the types of data (e.g., nominal, ratio) • Identify appropriate visualization for statistical results • Capture and expose provenance for end users
Ontologies as an Enabling Technology • Identify Data Source Vocabularies. Determine equivalency and subclass relationships between different data sources • Model inputs, outputs, assumptions, techniques of statistical models, simulations • Provide automated mappings between individuals using reasoning • Upper level ontology specialized with domain knowledge
Background: Understanding Human Cerebral Cortex Development and Disease Neural Differentiation Fertilization Cortical Layer Formation Human embryonic Stem Cells hESCs Cortical Development Clock from Analysis of RNA-Seq from Day 0 to 77captures genes temporal role in stages of corticogenesis Analyze brain grown in a dish model Create molecular signature of normal cortical development Analyze mutated genes associated with disease to understand developmental origins Compare to diseased patient stem cell lines to identify differences-> e.g. autism signature Joint work on CORTECON Data with Dr. Chris Fasano and Sally Temple at Neural Stem Cell Inst
Primary Components • Data Source Ingest • Generate ontology and instance mappings between sources (e.g., identifiers) • Identify appropriate statistical analyses based on the types of data (e.g., nominal, ratio) • Identify appropriate visualization for statistical results • Capture and expose provenance for end users
Semantic Numeric Exploration Technology Components Ontologies are used to integrate and map concepts between data sources. Also used to power smart search, browsing, and visualizations. Semantic technologies capture the provenance of mapping, analysis, and visualization.
How gene mutations alter stages of corticogenesis to cause disease.
Knowledge Graph Ex: Associations: p ≥ 0.9 nanopub McGuinness 10/614
ReDrugS(Repurposing of Drugs using Semantics) • Use semantic technologies to encode and process biological knowledge to generate hypotheses about new uses for existing drugs. • Leverage existing curated data sources, build reusable integrated content sources and infrastructure McCusker, J., Solanki, K., Chang, C., Dumontier, M., Dordick, J., and McGuinness, D.L. 2014. A Nanopublication Framework for Systems Biology and Drug Repurposing. Proc. of CSHALS 2014 Boston, MA. McCusker, J., Yan, R., Solanki, K., Erickson, J.S., Chang, C., Dumontier, M., Dordick, J., and McGuinness, D.L. 2014. A Nanopublication Framework for Biological Networks using Cytoscape.js. In Proceedings of International Conference on Biomedical Ontologies (ICBO 2014) (October 6-9 2014, Houston, TX). http://tw.rpi.edu/web/doc/redrugsnanopub
Nanopublications Simple yet semantically-rich encodings allow algorithms to not just find correlation but to look for causality using reasoning NanoPub_501799_Supporting NanoPub_501799_Assertion NanoPub_501799_Attribution
Experimental Method Coverage • 99.98% coverage of the ~936,000 nanopubs with evidence data from iRefIndex. • Top 10 methods (86% coverage):
Powering Interfaces: Querying the Knowledge Graph nanopub McGuinness 1/7/2015
Annotating the Chord Heat Map / Group Interactive Visualization Chord Heat Map is interactive and annotated. Also available on additional platforms: CAMPFIRE Proprietary Platform being developed at RPI
Discussion • Data analyst has much less manual work to find connections AND potential semantic relationships • Helping to move along the path from human expert to semi-automatic service to help move from correlation to potential causation • This is scratching the surface in the potential for semantic numeric integration but has potential now • Platform is ready for usage and collaborators • Contact us! dlm@cs.rpi.edu
Current Opportunities & Challenges: Vocabularies child health/ exposure… Metadata characterizingstudies & methods... Definitions... Studies... Evolving Ontology Data Science Domain Ontologies & Mappings... Use Cases... Examples: Relationships between: small for gestational age (SGA) and lifecycle outcome; preterm birth and neurocognitive faltering Policy The Open Biological and Biomedical Ontologies
Current Challenges & Opportunities: Annotation for Reuse • Data Analysis is both science and art • Many decisions such has how to handle missing values are made and often not recorded • Some toolkits automatically do some “cleanup”, again often not recorded • Integrating results from multiple analyses often requires deep understanding of what was done • Ongoing work is addressing adequate markup • Motivating example 2 CPP (Collaborative Perinatal Project) analyses done at RPI – what does it take to combine them…
Current Challenges and Opportunities: Context • Context of data is often missing however is often critical • True in many settings including current work on global health • E.g., Access to food – weather impacts; SES metadata inconsistent, incomplete, highly variable, … • What is your favorite challenge / opportunity?
More Information • Questions? dlm@cs.rpi.edu http://tw.rpi.edu/web/project/SemNExT Also, SemStats paper at International Semantic Web Conference (ISWC) https://semstats.wordpress.com/
Legend • GO term ID located in slices • Selected genes highlighted in cluster color • 2) Cluster Match Percentile = (cluster percentile) / (sum of fuzzy-cmeans.m cluster percentiles) • Only clusters with > .16 • Scaled to 100% for comparison purposes • p-score printed, wedge height scaled- p-score determines enrichedness of cluster per term • Chords drawn between instances of repeated terms (similarities in class provenance) • - Color of cluster w/ highest p-value for term chosen • 5) Pastel colored lines circling figure = clusters’ average GO term p-values 22.8% 42.62% 22.80%
Semantic Results - Certain strongly enriched terms in different clusters - Weaker link in one cluster suggests membership to others with higher p-val 22.8% Ex. GO:0031105 owl:sameAs umls:C1423771umls:C1423771rdfs:label ‘SEPT6’ GO:0031105 rdfs:label ‘septin complex’→ SEPT6 is the same gene as septin complex Visualization techniques reveal multiple other such relationships between semantics and statistics 42.62% However, only Cluster 4 (red) is enriched for septin complex; strengthens case for membership in Cluster 4. 22.80% Same logic applies to other terms heavily enriched for specific clusters. Semantics conflicts with statistical assessment of cluster assignments but also opens up dynamic between the two.
Cortical Development Clock from Analysis of RNA-Seq from Day 0 to 77captures genes temporal role in stages of corticogenesis