940 likes | 1.24k Views
Tutorial: Semantic Web Applications in Clinical Data Management. Eric Neumann Clinical Semantic Group W3C HCLS chair, MIT Fellow. Tutorial Overview. Bench-to-Bedside Vision Information Challenges Semantic Web : What is it? RDF: Recombinant Data (Aggregation)
E N D
Tutorial: Semantic Web Applications in Clinical Data Management Eric NeumannClinical Semantic Group W3C HCLS chair, MIT Fellow
Tutorial Overview • Bench-to-Bedside Vision • Information Challenges • Semantic Web: What is it? • RDF: Recombinant Data (Aggregation) • OWL: Vocabularies (NCI, SNOMED) • Rules • Translational Medicine Needs • Clinical Data Standards- CDISC • Re-Using Clinical Knowledge • Retrospective DBs: JANUS • Open Knowledge Benefits: Tox Commons
Bench-to-Bedside • Connecting pre-clinical and clinical studies • Translational Medicine • Patient Stratification & Personalized medicine (not the same) • Knowledge and Data Integration • Better Disease Understanding • Next Generation Therapies, New Applications • More Predictive (earlier) Safety Signals
Tox/Efficacy ADME Optim New Regulatory Issues Confronting Pharmaceuticals from Innovation or Stagnation, FDA Report March 2004
Translational Medicine • Enable physicians to more effectively translate relevant findings and hypotheses into therapies for human health • Support the blending of huge volumes of clinical research and phenotypic data with genomic research data • Apply that knowledge to patients and finally make individualized, preventative medicine a reality for diseases that have a genetic basis
Drug Discovery & Development Knowledge Qualified Targets Molecular Mechanisms Lead Generation Toxicity & Safety Lead Optimization Pharmacogenomics Biomarkers Clinical Trials Launch
Biomedical Research Clinical Practice Ecosystem: Goal State Merging Biomed Research, Clinical Trials and Clinical Practice
HCChoices HCLS Ecosystem Insurers Grants HMO,PPO Biomed Research Publications and Public Databases BKB Large Studies Gov/Funding Risks & Benefits Disease Areas Drug R&D EHR Mol Path Res Clin Res Chem Manuf Drug Programs Clin POC Surveillance BiomarkerTox HCP Public Preclin Marketing VA System R&D Gov/Regulatory CROs Clin Safety JANUS SafetyCommons
Information Challenges • No common way to bring data and docs together • HTML links carries no meaning with them • Today’s integration approaches prevent data re-use • No global way to annotate our experiments and experiences • Most annotations cannot be found by context • No “sci-blog” for data interpretation • Enterprise Information access and discoverability are weak • Making timely discoveries! • Why we all like Google • Cutting and pasting between docs promotes fact mutation and loss of provenance • Address business operations and tracking, and reduce static data copying
A web of information Courtesy ofR. Stevens
Distributed Nature of Biomedical Knowledge Patents Tox HCS Silos of Data… Biomarkers Targets Libraries Assays DrugRegistry Diseases Genotypes ClinicalTrials
The Big Picture In Drug R&D Hard to understand from just a few isolated Points of View
Clinical Papers Disease Subjects Genotype EnrollmentCriteria Dosing Observations Audit Trail Tox Signals Statistics Trials Ontology Whose Schema?
Why Searching ala Google is not enough Google’s ability to rank and graph without using semantics is comparable to… … a Drug R&D Project that looks for associations, but makes no attempt to find or represent mechanisms of action
The Current Web • What the computer sees: “Dumb” links • No semantics - <a href> treated just like <bold> • Minimal machine-processable information
The Semantic Web • Machine-processable semantic information • Semantic context published – making the data more informative to both humans and machines
Needed to realize the SW vision • A standard way of identifying things • A standard way of describing things • A standard way of linking things • Standard vocabularies for talking about things
The Semantic WebBasic Standards for Describing Things • Richer structure for basic resources (XML) • Describe Data by Semantics and Not Syntax: RDF • Define Semantics using RDFS or OWL • Reference and Relate All Resources using URIs • SPARQL is super model of SQL • Rules for higher level reasoning
The Technologies: RDF • Resource Description Framework (RDF) • W3C standard for making statements or hypotheses about data and concepts • Descriptive statements are expressed as triples: (Subject, Verb, Object) Property Subject Object <Compound HB-2182> <binds_to> <Target P38_alpha>
Facts as triples has_associated_disease PARK1 Parkinson disease subject predicate object
Parkinson disease MAPT Pick disease PARK1 Parkinson disease Parkinson disease TBP Spinocerebellar ataxia MAPT Pick disease PARK1 Parkinson disease TBP Spinocerebellar ataxia From triples to a graph MAPT Parkinson disease MAPT Pick disease PARK1 Parkinson disease TBP Parkinson disease TBP Spinocerebellar ataxia has_associated_disease
Neurodegenerative diseases isa Alzheimer disease Parkinson disease APP Alzheimer disease has_associated_disease PARK1 Parkinson disease Connecting graphs • Integrate graphs from multiple resources • Query across resources
The URI - global identification URI serves as a universal and uniform identifier for all web based resources.
A Family of Identifiers URI URL URN URI = Uniform Resource Identifier URL = Uniform Resource Locator URN = Uniform Resource Name LSID = Life Science Identifier LSID URI = Uniform Resource Identifier URL = Uniform Resource Locator URN = Uniform Resource Name LSID = Life Science Identifier http://www.w3.org/Addressing/
Uniform Resource Locator • A type or resource identifier • Identifies the location of a resource (or part thereof) • Specifies a protocol to access the resource • http, ftp, mailto • E.g., • http://www.nlm.nih.gov/ URI URL URN LSID
Uniform Resource Name • A type or resource identifier • Identifies the name of a resource • Location independent • Defines a namespace • E.g., • urn:isbn:0-262-02591-4 • urn:umls:C0001403 URI URL URN LSID
DNS name unique ID namespace urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 Life Science Identifier • A type or resource identifier • A type of URN • For biological entities • Specific properties • Versioned • Resolvable • Immutable • E.g., URI URL URN LSID http://lsid.sourceforge.net/
RDF Examples …as RDF-XML <cdisc:Subject http://clinic.com/study/T2271/subject/4183542663506> <nci:sex_code rdf:resource=“nci#Female” /> <cdisc:treatment rdf:resource=“http://clinic.com/study/T2271/subject/4183542663506/observation/O2241” /> <cdisc:vitalSigns rdf:resource=“http://clinic.com/study/T2271/subject/4183542663506/observation/O6561” /> <cdisc:adverseEvent rdf:resource=“http:// clinic.com/study/T2271/subject/4183542663506/observation/O6622” /> </cdisc:Subject> …as N3 <http://clinic.com/study/T2271/subject/4183542663506> a cdisc:Subject ; nci:sex_code nci:Female ; cdisc:treatment <http://clinic.com/study/T2271/subject/4183542663506/observation/O2241> ; cdisc:vitalSigns <http://clinic.com/study/T2271/subject/4183542663506/observation/O6561> ; cdisc:adverseEvent <http://clinic.com/study/T2271/subject/4183542663506/observation/O6622> .
Semantic Data Integration: Incremental Roadmap • Data assets remain as they are!They do not need to be modified • The wrapper abstracts out details related to location, access and data structure • Integration happens at the information level • Highly configurable and incremental process • Ability to specify declarative rules and mappings for further hypothesis generation
RDBM => RDF <hasDisease> <interactsWith> <canCause> <URI> <URI> {primary keys} {primary keys} <URI> <URI> <URI> <URI> Virtualized RDF
Patient (id = URI1) “Mr. X” name has_structured_test_result related_to Patient (id = URI1) Person (id = URI2) MolecularDiagnosticTestResult (id = URI4) associated_relative has_family_history identifies_mutation indicates_disease problem MYH7 missense Ser532Pro (id = URI5) FamilyHistory (id = URI3) “Sudden Death” Dialated Cardiomyopathy (id = URI6) EMR Data LIMS Data evidence2 95% Semantic Data IntegrationBridging Clinical and Genomic Information “Paternal” 1 90% degree type evidence1 • Rule/Semantics-based Integration: • Match Nodes with same Ids • Create new links: IF a patient’s structured test result indicates a disease • THEN add a “suffers from link” to that disease
90% evidence Dialated Cardiomyopathy (id = URI6) “Paternal” suffers_from 1 “Mr. X” type degree name indicates_disease has_structured_test_result related_to Patient (id = URI1) Person (id = URI2) StructuredTestResult (id = URI4) identifies_mutation associated_relative has_family_history has_gene MYH7 missense Ser532Pro (id = URI5) problem FamilyHistory (id = URI3) “Sudden Death” Semantic Data Integration:Bridging Clinical and Genomic Information RDF Graphs provide a semantics-rich substrate for decision support. Can be exploited by SWRL Rules
Topic: GSK3beta Topic Disease: DiabetesT2 Alt Dis: Alzheimers Target: GSK3beta Cmpd: SB44121 CE: DBP Team: GSK3 Team Person: John Related Set Path: WNT Drug Discovery Dashboard http://www.w3.org/2005/04/swls/BioDash Semantic Data Integration and Visualization:Drug Discovery
Semantic Data Integration:Bridging Chemistry and Molecular Biology Semantic Lenses: Different Views of the same data BioPax Components Target Model urn:lsid:uniprot.org:uniprot:P49841 Apply Correspondence Rule:if ?target.xref.lsid == ?bpx:prot.xref.lsidthen ?target.correspondsTo.?bpx:prot
Semantic Data IntegrationBridging Chemistry and Molecular Biology • Lenses can aggregate, accentuate, or even analyze new result sets • Behind the lens, the data can be persistently stored as RDF-OWL • Correspondence does not need to mean “same descriptive object”, but may mean objects with identical references
Non-synonymous polymorphisms from db-SNP Semantic Data IntegrationPathway Polymorphisms • Merge directly onto pathway graph • Identify targets with lowest chance of genetic variance • Predict parts of pathways with highest functional variability • Map genetic influence to potential pathway elements • Select mechanisms of action that are minimally impacted by polymorphisms
Scenario: Biomarker Qualification • Semantics which Define… • Biomarker Roles • Disease • Toxicity • Efficacy • Molecular and cytological markers • Tissue-specific • High content screening derived information • Different sets associated with different predictive tools • Statistical discrimination based on selected samples • Predictive power • Alternative cluster prediction algorithms • Support qualifications from multiple studies (comparisons) • Causal mechanisms • Pathways • Population variation
Semantic Data Integration: Advantages • RDF: Graph based data model • More expressive than the tree based XML Schema Model • RDF: Reification • Same piece of information can be given different values of belief by different clinical genomic researchers • Potential for “Schema-less” Data Integration • Hypothesis driven approach to defining mapping rules • Can define mapping rules on the fly • Incremental approach for Data Integration • Ability to introduce new data sources into the mix incrementally at low cost • Use of Ontology to disallow meaningless mapping rules? • For e.g., mapping a gene to a protein…
Semantic Data Integration“Schema-free” data integration • Low cost approach for data integration • No need for maintenance of costly schema mappings • Ability to “merge” RDF graphs based on simple declarative rules that specify: • Equality of URIs • Connecting nodes of same type • Connecting two nodes associated by a “path” • Disadvantage: Potential for specifying spurious non-sensical rules
Semantic Data IntegrationUse of Reification • Level of accuracy of test result. • Sensitivity and Specificity of lab result • Level of confidence in genotyping or gene sequencing • Probabilistic relationships • Likelihood that a particular test result or condition is indicative of a disease or other medical condition • Level of trust in a resource • Results from a lab may be trusted more than result from another • Results from well known health sites (NLM) may be trusted more than others • Belief attribution • Scientific hypotheses may be attributed to appropriate researchers
The Available Data Space Separate RDF documents are merged automatically into one aggregate graph.
Recombination in Molecular Genetics works due to proper alignment of genetic regions, thereby preventing gene loss, mangling, or duplication.
Recombinant Data Graphs can be filtered and pivoted, without losing meaning
Recombinant Data • Mash-ups that don’t lose perspective • Dynamic mixing of data • Provide Different Views for Different Roles and Functions • Dashboards • Direct output of a SPARQL query
Key Functionality offered by Semantic Web • Ubiquity • Same identifiers for anything from anywhere • Discoverability • Global search on any entity • Interoperability • => “Recombinant Data” is Application Independence