590 likes | 894 Views
Agenda. Introduction to Semantic Web Semantic Solutions at Lilly W3C's Semantic Web for Health Care and Life Sciences Interest Group. Introduction to the Semantic Web. Drivers for the Semantic Web. Business models develop rapidly these days, so infrastructure that supports change is needed Organizations are increasingly forming and disbanding collaborations Data is growing so quickly that it is no longer possible for individuals to identify patterns in their heads Increasing recognition o1144
E N D
1. The Integration of Biological Data Using Semantic Web Technologies
3. Introduction to the Semantic Web
4. Drivers for the Semantic Web Business models develop rapidly these days, so infrastructure that supports change is needed
Organizations are increasingly forming and disbanding collaborations
Data is growing so quickly that it is no longer possible for individuals to identify patterns in their heads
Increasing recognition of the benefits of collective intelligence
5. Characterizing the Semantic Web Semantic Web is an interoperability technology
An architecture for interconnected communities and vocabularies
A set of interoperable standards for knowledge exchange
6. Creating a Web of Data
7. Resource Description Framework (RDF)
8. RDFS and OWL RDFS
Is a simple vocabulary for describing properties and classes of RDF resources
Provides semantics for hierarchies of properties and classes
Designed to support inferencing
OWL
Explicitly represents meaning of terms in vocabularies and the relationships between those terms
Separate layers have been defined balancing expressibility vs. implementability (OWL Lite, OWL DL, OWL Full)
Supports inferencing
9. SPARQL as a Unifying Source
10. Semantic Web Solutions at Lilly
11. Discovery Metadata: Goals Integrate master data throughout the discovery process to enable information sharing/integration for the scientific community
Model key relationships between master data classes
Provide ability to integrate disparate data sets quicker than the normal warehouse paradigm typically allows
Create a re-usable and sustainable semantic implementation
Allow for user-driven, manual curation of key data relationships
12. Discovery Metadata: Ontology
13. Discovery Metadata: Architecture
14. CATIE: Overview Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE)
Was the most comprehensive independent trial ever completed to examine existing anti-psychotic therapies for schizophrenia
Provides detailed information comparing the effectiveness and side effects of five medications currently used to treat schizophrenia
Greatly enhances the knowledge available to guide treatment choices for people with schizophrenia
15. CATIE: Goals Determine whether semantic integration and analysis of the CATIE data set in the context of metabolic and signal transduction pathways with receptor affinities can provide answers to specific scientific questions:
Which pathways are associated with response to the 5 different schizophrenia drugs?
How do these pathways compare between treatment arms?
Which receptors are associated with response to the 5 schizophrenia drugs?
How are the pathways, receptors and the drug response genes from the CATIE data set related?
16. CATIE: Drugs and Data Sets CATIE Drugs:
Olanzapine
Perphenazine
Quetiapine
Risperidone
Ziprasidone
Datasets:
Entrez Gene
Pubchem
Assay (Receptor Affinity Data)
KEGG
Reactome
Biocyc
Transpath
17. CATIE: Architecture
18. CATIE: Conclusions Efficient semantic integration can be accomplished by using RDF
Powerful complex data modeling can be achieved by using graph principles inherent in RDF
Easy translation of scientific questions to graph queries using SPARQL and SEM_MATCH
Customized outputs can easily be generated by making slight changes in the SPARQL query pattern
19. Competitive Intelligence: Overview Competitive Intelligence is a purposeful, ethical and co-coordinated monitoring of the competitors in any industry within a specific market place to:
Strategically gain foreknowledge of recent developments of your competitor's plans
Make calculated informed business decisions and formulate operational strategy
Provide a mechanism for actively surveying the public information for competitive intelligence in Endocrine
20. Competitive Intelligence: Goals Does such a CI effort significantly benefit from a semantic component?
Does the project significantly benefit from semantic integration?
Are there pre-existing ontologies for company and method of action domains?
Does NLP or text mining work for this kind of data?
Does “buried” knowledge exist within datasets that can be discovered using inference and reasoning?
21. Competitive Intelligence: Integration Challenges
22. Competitive Intelligence: NLP
24. Competitive Intelligence: Inferencing
25. Competitive Intelligence: Conclusions Semantic Integration (instance mapping using NLP) coupled with RDF data model was successful in answering questions in Competitive Intelligence
Ontologies provide a powerful framework in providing dictionaries and taxonomical relations that help to reason and inference the data for knowledge discovery
Manual curation is a tedious, error prone and labor intensive-task
A semi-automated computer-based solution that utilizes ontologies, semantic integration and NLP could drastically reduce manual curation process and maintain high quality information
26. Metadata Repository: Goals Aggregate experiment metadata from diverse relational databases into an Oracle 11g for scientific investigation
Provide a unified vocabulary for scientific investigation
Avoid a complex architecture and extended development effort
Realize benefits in the near-term
Preprocess metadata to improve efficiency
Characterize the type of questions that ontology should answer
Identify stable semantic technologies, do not employ parsers
Allow semantic and relational databases to work together
Provide browser, visualization, and query access into repository
27. Metadata Repository: Ontology
28. Metadata Repository: Architecture
29. Metadata Repository: Implementation Protégé Ontology Editor
Oracle Semantic Technologies 11g
D2R Map (Database to RDF Mapping)
C# development in Visual Studio 2005
Current data sources include:
Expression Data : Affymetrix, Illumina, Agilent
aCGH Data
RNAi Screening Data
Reagent Data
Gene Ontology (GO)
Medical Subject Headings (MeSH)
Currently ~30 million triples
30. Metadata Repository: Conclusion It’s now possible for users to ask questions such as:
Get all the interactions for methylases that are involved in Colon cancer. For all these genes, get the expression and aCGH values for all LSCDD colon cancer samples
Find cell lines in which RNAi data has been generated using Dharmacon reagents
Retrieve the antibodies that have been used to assess the AKT1 pathway activity in MCF7
Find all the experiments that were done using my sample
Find all samples which are grade III colorectal cancer. For these sample, retrieve the expression, mutation and aCGH data
31. External Collaborations RDF Access to Relational Databases - Chris Bizer, Eric Prud'hommeaux
Scalability testing of relational to RDF mapping approaches
End User Semantic Web Authoring - David Karger
Enhancing the scalability and robustness of the Exhibit and Potluck tools
Scientist-Driven Semantic Integration of Knowledge in Alzheimer's Disease - Tim Clark, June Kinoshita
Project to develop an integrated knowledge infrastructure for the neuromedical research community, pairing rich digital semantic context with the ever-growing digital scientific content on the web
Provenance Collection and Management - Carole Goble, Beth Plale
Project to develop a metadata taxonomy for global data at Lilly which enables the rapid integration of data and mining/analysis algorithms into dataflows which support clinical and discovery decisions
W3C’s Health Care and Life Sciences Interest Group
32. Conclusion Semantic Web provides a flexible framework for data integration
Data integration needs (and issues) abound at Lilly
Lilly is seeing tangible benefits in multiple projects from semantic Web
Focus on incremental adoption of the technology
Tools are improving, but more work is needed
Lilly use of Semantic Web technology isn’t atypical in health care and life sciences organizations
33. W3C Semantic Web for Health Care and Life Sciences Interest Group
34. What is the Mission of HCLS IG? The mission of HCLS is to develop, advocate for, and support the use of Semantic Web technologies for biological science, translational medicine and health care. These domains stand to gain tremendous benefit by adoption of Semantic Web technologies, as they depend on the interoperability of information from many domains and processes for efficient decision support.
35. Task Forces Terminology – Semantic Web representation of existing resources
Task lead - John Madden
BioRDF – integrated neuroscience knowledge base
Task lead - Kei Cheung
Linking Open Drug Data – aggregation of Web-based drug data
Task lead - Chris Bizer
Scientific Discourse – building communities through networking
Task leads - Tim Clark, John Breslin
Clinical Observations Interoperability – patient recruitment in trials
Task lead - Vipul Kashyap
Other Projects: Clinical Decision Support, URI Workshop, Collaborations with CDISC & HL7
36. Terminology Task Force Task Lead: John Madden
Participants: Chimezie Ogbuji, Helen Chen, Holger Stenzhorn, Mary Kennedy, Xiashu Wang, Rob Frost, Jonathan Borden, Guoqian Jiang
37. Terminology: Overview Goal is to identify use cases and methods for extracting Semantic Web representations from existing, standard medical record terminologies, e.g. UMLS
Methods should be reproducible and, to the extent possible, not lossy
Identify and document issues along the way related to identification schemes, expressiveness of the relevant languages
Initial effort will start with SNOMED-CT and UMLS Semantic Networks and focus on a particular sub-domain (e.g. pharmacological classification)
38. BioRDF Task Force Task Lead: Kei Cheung
Participants: Scott Marshall, Eric Prud’hommeaux, Susie Stephens, Andrew Su, Steven Larson, Huajun Chen, TN Bhat, Matthias Samwald, Erick Antezana, Rob Frost, Ward Blonde, Holger Stenzhorn, Don Doherty
39. BioRDF: Answering Questions Goals: Get answers to questions posed to a body of collective knowledge in an effective way
Knowledge used: Publicly available databases, and text mining
Strategy: Integrate knowledge using careful modeling, exploiting Semantic Web standards and technologies
40. BioRDF: Looking for Targets for Alzheimer’s Signal transduction pathways are considered to be rich in “druggable” targets
CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease
Casting a wide net, can we find candidate genes known to be involved in signal transduction and active in Pyramidal Neurons?
42. BioRDF: SPARQL Query
43. BioRDF: Results: Genes, Processes DRD1, 1812 adenylate cyclase activation
ADRB2, 154 adenylate cyclase activation
ADRB2, 154 arrestin mediated desensitization of G-protein coupled receptor protein signaling pathway
DRD1IP, 50632 dopamine receptor signaling pathway
DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway
DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway
GRM7, 2917 G-protein coupled receptor protein signaling pathway
GNG3, 2785 G-protein coupled receptor protein signaling pathway
GNG12, 55970 G-protein coupled receptor protein signaling pathway
DRD2, 1813 G-protein coupled receptor protein signaling pathway
ADRB2, 154 G-protein coupled receptor protein signaling pathway
CALM3, 808 G-protein coupled receptor protein signaling pathway
HTR2A, 3356 G-protein coupled receptor protein signaling pathway
DRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second messenger
SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second messenger
MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide second messenger
CNR2, 1269 G-protein signaling, coupled to cyclic nucleotide second messenger
HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second messenger
GRIK2, 2898 glutamate signaling pathway
GRIN1, 2902 glutamate signaling pathway
GRIN2A, 2903 glutamate signaling pathway
GRIN2B, 2904 glutamate signaling pathway
ADAM10, 102 integrin-mediated signaling pathway
GRM7, 2917 negative regulation of adenylate cyclase activity
LRP1, 4035 negative regulation of Wnt receptor signaling pathway
ADAM10, 102 Notch receptor processing
ASCL1, 429 Notch signaling pathway
HTR2A, 3356 serotonin receptor signaling pathway
ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)
PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway
EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway
NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway
CTNND1, 1500 Wnt receptor signaling pathway
44. LODD Task Force Task Lead: Chris Bizer
Participants: Anja Jentzsch, Kristin Tolle, Eric Prud’hommeaux, Don Doherty, Susie Stephens, Bosse Andersson, Scott Marshall, Glen Newton, Michel Dumontier, TN Bhat, Oktie Hassanzadeh
45. LODD: Introduction
46. LODD: Potential Links between Data Sets
47. LODD: Data Set Evaluation
48. LODD: Potential questions to answer Physicians and Pharmacists
What are alternative drugs for a given indication (disease)?
What are equivalent drugs (generic version of a brand name, or the chemical name of a active ingredient)?
Are there ongoing clinical trials for a drug?
Patients
What background information is available about a drug?
What are the contraindications of a drug?
Which alternative drugs are available?
What are the results of clinical trials for a drug?
Pharmaceutical Companies
What are other companies with drugs in similar areas?
Which companies have a similar therapeutic focus?
49. LODD: Linked Version of ClinicalTrials.gov Total number of triples: 6,998,851
Number of Trials: 61,920
RDF links to other data sources: 177,975
Links to:
DBpedia and YAGO (from intervention and conditions)
GeoNames (from locations)
Bio2RDF.org's PubMed (from references)
50. LODD: Mashing Clinical Trials and Geo
51. Scientific Discourse Task Force Task Lead: Tim Clark, John Breslin
Participants: Uldis Bojars, Paolo Ciccarese, Sudeshna Das, Ronan Fox, Tudor Groza, Christoph Lange, Matthias Samwald, Elizabeth Wu, Holger Stenzhorn, Marco Ocana, Kei Cheung, Alexandre Passant
52. Scientific Discourse: Overview
53. Scientific Discourse: Goals Provide a Semantic Web platform for scientific discourse in biomedicine
Linked to
key concepts, entities and knowledge
Specified
by ontologies
Integrated with
existing software tools
Useful to
Web communities of working scientists
54. Scientific Discourse: Some Parameters Discourse categories: research questions, scientific assertions or claims, hypotheses, comments and discussion, and evidence
Biomedical categories: genes, proteins, antibodies, animal models, laboratory protocols, biological processes, reagents, disease classifications, user-generated tags, and bibliographic references
Driving biological project: cross-application of discoveries, methods and reagents in stem cell, Alzheimer and Parkinson disease research
Informatics use cases: interoperability of web-based research communities with (a) each other (b) key biomedical ontologies (c) algorithms for bibliographic annotation and text mining (d) key resources
55. Scientific Discourse: SWAN+SIOC SIOC
Represent activities and contributions of online communities
Integration with blogging, wiki and CMS software
Use of existing ontologies, e.g. FOAF, SKOS, DC
SWAN
Represents scientific discourse (hypotheses, claims, evidence, concepts, entities, citations)
Used to create the SWAN Alzheimer knowledge base
Active beta participation of 144 Alzheimer researchers
Ongoing integration into SCF Drupal toolkit
56. COI Task Force Task Lead: Vipul Kashap
Participants: Eric Prud’hommeaux, Helen Chen, Jyotishman Pathak, Rachel Richesson, Holger Stenzhorn
57. COI: Bridging Bench to Bedside How can existing Electronic Health Records (EHR) formats be reused for patient recruitment?
Quasi standard formats for clinical data:
HL7/RIM/DCM – healthcare delivery systems
CDISC/SDTM – clinical trial systems
How can we map across these formats?
Can we ask questions in one format when the data is represented in another format?
58. COI: Use Case Pharmaceutical companies pay a lot to test drugs
Pharmaceutical companies express protocol in CDISC
-- precipitous gap –
Hospitals exchange information in HL7/RIM
Hospitals have relational databases
59. Type 2 diabetes on diet and exercise therapy or
monotherapy with metformin, insulin
secretagogue, or alpha-glucosidase inhibitors, or
a low-dose combination of these at 50%
maximal dose. Dosing is stable for 8 weeks prior
to randomization.
…
?patient takes meformin . Inclusion Criteria
60. Use of warfarin (Coumadin), clopidogrel
(Plavix) or other anticoagulants.
…
?patient doesNotTake anticoagulant .
Exclusion Criteria
61. ?medication1 sdtm:subject ?patient ;spl:activeIngredient ?ingredient1 .
?ingredient1 spl:classCode 6809 . #metformin
OPTIONAL {
?medication2 sdtm:subject ?patient ; spl:activeIngredient ?ingredient2 .?ingredient2 spl:classCode 11289 . #anticoagulant
} FILTER (!BOUND(?medication2)) Criteria in SPARQL
62. Getting Involved Benefits to getting involved include:
early access to use cases and best practice
influence standard recommends
cost effective exploration of new technology through collaboration
Get involved by contacting the chairs:
team-hcls-chairs@w3.org