590 likes | 707 Views
The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). The Study of Genes. Chromosomal location Sequence Sequence Variation Splicing
E N D
The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam
The Study of Genes... • Chromosomal location • Sequence • Sequence Variation • Splicing • Protein Sequence • Protein Structure
… and Their Function • Homology • Motifs • Publications • Expression • HTS • In Vivo/Vitro Functional Characterization
Metabolic and regulatory pathway induction Understanding Mechanisms of Disease
Development of Drugs, Vaccines, Diagnostics • Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics • Development requires • Preclinical research • Clinical trials • Long-term clinical research • All of which often feeds back into ongoing Genomics research and discovery.
Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty
“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)
The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources
Andy Law’s First Law ESTC “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008
Andy Law’s Second Law ESTC “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008
What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …
Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006
Semantic Web Approach • Convert all data sources to RDF representation (local or distributed) • Optional: Collect the data to scalable semantic repository • Apply light-weight reasoning to specify formal interpretations of the data, e.g.: • remove redundancy, • establish equalities, etc • Derive new implicit knowledge ESTC Sept, 2008
alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA
name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption
Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust
What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines
standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines
Real life examples • handcrafted • music: CDnow(2410/5), MusicMoz(1073/7) • biomedical: SNOMED (200k), GO(15k), Emtree(45k+190kSystems biology • ranging from lightweight • Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) • ranging from small (METAR) to large (UNSPC)
Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),
Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Author-of pers05 ISBN... Publ-by Author-of pers05 ISBN... MIT Publ-by Author-of ISBN... Bluffer’s guide to RDF (1) • Object --Attribute-> Value triples • objects are web-resources • Value is again an Object: • triples can be linked • data-model = graph
What does RDF Schema add? • Defines vocabulary for RDF • Organizes this vocabulary in a typed hierarchy • Class, subClassOf, type • Property, subPropertyOf • domain, range Person subClassOf subClassOf range domain Teacher Student supervises type type supervises Frank Marta
OWL: things RDF Schema can’t do • equality • enumeration • number restrictions • Single-valued/multi-valued • Optional/required values • inverse, symmetric, transitive • boolean algebra • Union, complement • …
different owners & locations Web of Data: anybody can say anything about anything • All identifiers are URL's (= on the Web) • Allows total decoupling of • data • vocabulary • meta-data [<x> IsOfType <T>] x T <prince>
RDF(S) have a (very small) formal semantics • Defines what other statements are implied by a given set of RDF(S) statements • Ensures mutual agreement on minimal contentbetween parties without further contact • In the form of “entailment rules” • Very simple to compute(and not explosive in practice)
RDF(S) semantics: examples • Aspirin isOfType PainkillerPainkiller subClassOf Drug Aspirin isOfType Drug • aspirin alleviates headachealleviates range symptom headache isOfType symptom
RDF(S) semantics: examples • AspirinisOfTypePainkillerPainkillersubClassOfDrug AspirinisOfTypeDrug • aspirin alleviates headachetreatsrangesymptom headacheisOfTypesymptom
RDF(S) semantics • X R Y + R domain T X IsOfType T • X R Y + R range T Y IsOfType T • T1 SubClassOf T2 +T2 SubClassOf T3 T1 SubClassOf T3 • X IsOfType T1 +T1 SubClassOf T2 X IsOfType T1
OWL also has a formal semantics • Defines what other statements are implied by a given set of statements • Ensures mutual agreement on content(both minimal and maximal)between parties without further contact • Can be used for integrity/consistency checking • Hard to compute (and rarely/sometime/always explosive in practice)
OWL semantics: minimal • vanGogh isOfType ImpressionistImpressionist subClassOf Painter vanGogh isOfType Painter • vanGogh painter-of sunflowerspainter-of domain painter vanGogh isOfType painter
OWL semantics: maximal • vanGogh isOfType ImpressionistImpressionist disjointFrom Cubist NOT: vanGogh isOfType Cubist • painted-by has-cardinality 1sun-flowers painted-by vanGoghPicasso different-individual-from vanGogh NOT: sun-flowers painted-by Picasso
Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.
trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”
Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
How to handle multiple ontologies: ontology matching • Linguistics & structure • Shared vocabulary • Instance-based matching • Shared background knowledge
Q Matching through shared vocabulary
sharedbackgroundknowledge ontology 2 ontology 1 Matching using shared background knowledge
Some working examples? • Linked Life Data http://www.linkedlifedata.com • DOPE • HCLS http://www.w3.org/2001/sw/hcls/
Linked Life Data Overview ESTC • LinkedLifeData - statistics: • Number of statements: 1,159,857,602 • Number of explicit statements: 403,361,589 • Number of entities: 128,948,564 • Platform to automate the process: • Infrastructure to store and inferences • Transform the structured data sources to RDF • Provide web interface to access the data • Currently operates over OWLIM semantic repository • Publicly available at: http://www.linkedlifedata.com ESTC Sept, 2008
Light Weight Reasoning in Linked Life Data urn:biogrid:Interaction urn:uniprot:Protein urn:uniprot:FBgn0068575 urn:biogrid:FBgn0068575 rdf:type sameAs rdf:type urn:pubmed:15904 rdf:seeAlso rdf:type urn:intact:Interaction urn:uniprot:Q709356 hasParticipant Use relationships to derive new implicit knowledge Resolve the syntactic differences in the identifiers interactsWith sameAs rdf:type interactsWith urn:biogrid:15904 hasParticipant urn:uniprot:P104172 urn:intact:1007 sameAs rdf:seeAlso urn:biogrid:FBgn00134235 urn:uniprot:FBgn00134235 These are only examples resource names ESTC ESTC Sept, 2008
ESTC ESTC Sept, 2008
Some working examples? • Linked Life Data http://www.linkedlifedata.com • DOPE • HCLS http://www.w3.org/2001/sw/hcls/
The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms