270 likes | 543 Views
Tactical Formalization of Linked Open Data. Michel Dumontier , Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University. Linked Open Data provides an incredibly dynamic, rapidly growing set of interlinked resources.
E N D
Tactical Formalization of Linked Open Data • Michel Dumontier, Ph.D. • Associate Professor of Medicine (Biomedical Informatics) • Stanford University @micheldumontier::RDA Workshop:23-02-2015
Linked Open Data provides an incredibly dynamic, rapidly growing set of interlinkedresources @micheldumontier::RDA Workshop:23-02-2015
Bio2RDF is an open source project to unify the representation and interlinking of biological data Linked Data for the Life Sciences chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications 11B triples from 35 datasets Each dataset is described using its own schema. @micheldumontier::RDA Workshop:23-02-2015
Query data,ontologies, and services simultaneouslyusing Federated Queries over Independent SPARQL DBs Gene Ontology Get all protein catabolic processes (and more specific) in biomodels query against <http://bioportal.bio2rdf.org/sparql> SELECT?go ?label count(distinct ?x) WHERE { ?go rdfs:label ?label . ?go rdfs:subClassOf?tgo ?tgordfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . } } @micheldumontier::RDA Workshop:23-02-2015
Despite all the data, it’s still hard to find answers to questions Because there are many ways to represent the same data and each dataset represents it differently @micheldumontier::RDA Workshop:23-02-2015
multiple models for the same kind of data do emerge, each with their own merit @micheldumontier::RDA Workshop:23-02-2015
This lack of coordination makes Linked Open Data somewhat chaotic and unwieldy @micheldumontier::RDA Workshop:23-02-2015
Massive Proliferation of Ontologies / Vocabularies could be harnessed to bring order out of chaos @micheldumontier::RDA Workshop:23-02-2015
The Semanticscience Integrated Ontology (SIO) Is used to ground Bio2RDF, SADI semantic web services 1300+ classes, 201 object properties (inc. inverses) 1 datatype property @micheldumontier::RDA Workshop:23-02-2015
Ontology Design PatternsObjects, processesand their attributes. @micheldumontier::NIDM:Jan 26,2015
Multi-Stakeholder Efforts to Standardize Representations are Reasonable, Long Term Strategies for Data Integration @micheldumontier::RDA Workshop:23-02-2015
tactical formalization • Turn linked data into • whatever you need STANDARDS APPLICATION SPECIFIC W3C Community-standards Application standards Visualization standards Drug repurposing Verifying annotations in biomodels Discovering aberrant pathways @micheldumontier::RDA Workshop:23-02-2015
w/ Deborah McGuiness, Jim McCusker @ RPI @micheldumontier::RDA Workshop:23-02-2015
ReDrugS uses Nanopublications • A nanopublication is a structured digital object to associate a statement composed of one or more triples with its evidence/provenance, and digital object metadata. • http://nanopub.org @micheldumontier::RDA Workshop:23-02-2015
nanopublications are dynamically generated from a SPARQL query Fit for purpose to a target schema Here, we make an ontological commitment @micheldumontier::RDA Workshop:23-02-2015
we can do more @micheldumontier::RDA Workshop:23-02-2015
Have you heard of OWL? @micheldumontier::RDA Workshop:23-02-2015
SBML-based biomodels place semantic annotations in an annotation element <speciesmetaid="_525530" id="GLCi" compartment="cyto" > <annotation> <rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/"> <rdf:Descriptionrdf:about="#_525530"> <bqbiol:is> <rdf:Bag> <rdf:lirdf:resource=“http://identifiers.org/obo.chebi/CHEBI:4167"/> <rdf:lirdf:resource=“http://identifiers.org/kegg.compound/C00031"/> </rdf:Bag> </bqbiol:is> </rdf:Description> </rdf:RDF> </annotation> </species> subject predicate (qualifier) object The intent is to express that the species represents a substance composed of glucose molecules We also know from the SBML model that this substance is located in the cytosol and with a (initial) concentration of 0.09765M @micheldumontier::RDA Workshop:23-02-2015
By converting models into formal representations of knowledge we get to: • capturethe semantics of models and the biological systems they represent • leverage knowledge explicit in linked terminologies • validatethe accuracy of the annotations / models • discover biological implications inherent in the models • query the results of simulations in the context of the biological knowledge @micheldumontier::RDA Workshop:23-02-2015
Model verification After reasoning, we found 27 models to be inconsistent reasons • our representation - functions sometimes found in the place of physical entities (e.g. entities that secrete insulin). better to constrain with appropriate relations • SBML abused – e.g. species used as a measure of time • Incorrect annotations - constraints in the ontologies themselves mean that the annotation is simply not possible Integrating systems biology models and biomedical ontologies. BMC Syst Biol. 2011 Aug 11;5:124. doi: 10.1186/1752-0509-5-124. @micheldumontier::RDA Workshop:23-02-2015
Identification of drug and disease enriched pathways Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. • Approach • Integrated 3 datasets & 7 terminologies • DrugBank, PharmGKBand CTD • MeSH, ATC, ChEBI, UMLS, SNOMED, ICD, DO • Formalized into an OWL-EL ontology • 4 class top level ontology • logical class axioms • 650,000+ classes, 3.2M subClassOf axioms • Identified significant associations using enrichment analysis over the fully inferred knowledge base @micheldumontier::RDA Workshop:23-02-2015
Benefit 1: Enhanced Query Capability • Use any mapped terminology to query a target resource. • Use knowledge in target ontologies to formulate more precise questions • ask for drugs that are associated with diseases of the joint: ‘Chikungunya’ (do:0050012) is defined as a viral infectious disease located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya virus’ (taxon:37124). • Learn relationships that are inferred by automated reasoning. • alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236) @micheldumontier::RDA Workshop:23-02-2015
Benefit 2: Knowledge Discovery through Enrichment Analysis • OntoFunc: Tool to discover significant associations between sets of objects and ontology categories. • We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease. • Mood disorder (do:3324) associated with Zidovudine Pathway (pharmgkb:PA165859361). Zidovudineis used to treat HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia @micheldumontier::RDA Workshop:23-02-2015
http://tiny.cc/hcls-datadesc-ed @micheldumontier::RDA Workshop:23-02-2015
61 metadata elements Core • Identifiers • Title • Description • Attribution • Homepage • License • Language • Keywords • Concepts and vocabularies used • Standards • Publication Editor & Validator underway Provenance and Change • Version number • Source • Provenance: retrieved from, derived from, created with • Frequency of change Availability • Format • Download URL • Landing page • SPARQL endpoint 13 Content Statistics • With SPARQL queries @micheldumontier::RDA Workshop:23-02-2015
NIH Big Data to Knowledge (BD2K) Center of Excellence Mark Musen (PI), Michel Dumontier (Co-I), Purvesh Khatri (Co-I), Olivier Gevaert (Co-I) Goals: Tools to facilitate template construction and semi-automated metadata annotation Enable dataset discovery & ruse Partnership with ImmPort imm. Repo, BioSharing, and the Stanford Library @micheldumontier::RDA Workshop:23-02-2015
dumontierlab.com michel.dumontier@stanford.edu Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier @micheldumontier::RDA Workshop:23-02-2015