600 likes | 732 Views
The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003.
E N D
The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam
Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003 Life Science Data Recent focus on genetic data “genomics: the study of genes and their function. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery of breakthrough healthcare products by revealing thousands of new biological targets for the development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein drugs, and potentially gene therapy.” The Pharmaceutical Research and Manufacturers of America - http://www.phrma.org/genomics/lexicon/g.html Study of genes and their function Understanding molecular mechanisms of disease Development of drugs, vaccines, and diagnostics
The Study of Genes... • Chromosomal location • Sequence • Sequence Variation • Splicing • Protein Sequence • Protein Structure
… and Their Function • Homology • Motifs • Publications • Expression • HTS • In Vivo/Vitro Functional Characterization
Metabolic and regulatory pathway induction Understanding Mechanisms of Disease
Development of Drugs, Vaccines, Diagnostics • Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics • Development requires • Preclinical research • Clinical trials • Long-term clinical research • All of which often feeds back into ongoing Genomics research and discovery.
The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources
What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …
Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty
“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)
Pharmaceutical Productivity Source: PhRMA & FDA 2003
Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006
The Medical tower of Babel • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCI Cancer Ontology: • 17,000 classes (about 1M definitions),
alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA
name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption
Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust is this page really about Pamela Anderson?
What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines
standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines
concepts, properties, relations, functions Consensual knowledge machine processable Abstract model of some domain Shared content-vocabularies:Ontologies Formal, explicit specification of a shared conceptualisation
Real life examples • handcrafted • music: CDnow(2410/5), MusicMoz(1073/7) • biomedical: SNOMED (200k), GO(15k), Emtree(45k+190kSystems biology • ranging from lightweight • Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) • ranging from small (METAR) to large (UNSPC)
Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),
Increasing semantic “weight” What’s inside an ontology? • terms + specialisation hierarchy • classes + class-hierarchy • instances • slots/values • inheritance (multiple? defaults?) • restrictions on slots (type, cardinality) • properties of slots (symm., trans., …) • relations between classes (disjoint, covers) • reasoning tasks: classification, subsumption
NB: we’re not doing philosophy • Ontologies are not definitive descriptions of what exists in the world (= philosphy) • Ontologies are models of the worldconstructed to facilitate communication • Yes, ontologies exist(because we build them)
Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Stack of languages • XML: • Surface syntax, no semantics • XML Schema: • Describes structure of XML documents • RDF: • Datamodel for “relations” between “things” • RDF Schema: • RDF Vocabular Definition Language • OWL: • A more expressive Vocabular Definition Language
Author-of pers05 ISBN... Publ-by Author-of pers05 ISBN... MIT Publ-by Author-of ISBN... Bluffer’s guide to RDF (1) • Object --Attribute-> Value triples • objects are web-resources • Value is again an Object: • triples can be linked • data-model = graph
<rdf:Descriptionrdf:about=“#pers05”> <authorOf>ISBN...</authorOf> </rdf:Description> claims Author-of pers05 NYT ISBN... Bluffer’s guide to RDF (2) • Every identifier is a URL = world-wide unique naming! • Has XML syntax • Any statement can be an object • graphs can be nested
What does RDF Schema add? • Defines vocabulary for RDF • Organizes this vocabulary in a typed hierarchy • Class, subClassOf, type • Property, subPropertyOf • domain, range Person subClassOf subClassOf range domain Teacher Student supervises type type supervises Frank Marta
Stack of languages • XML: • Surface syntax, no semantics • XML Schema: • Describes structure of XML documents • RDF: • Datamodel for “relations” between “things” • RDF Schema: • RDF Vocabular Definition Language • OWL: • A more expressive Vocabular Definition Language
OWL: things RDF Schema can’t do • equality • enumeration • number restrictions • Single-valued/multi-valued • Optional/required values • inverse, symmetric, transitive • boolean algebra • Union, complement • …
OWL Light • (sub)classes, individuals • (sub)properties, domain, range • conjunction • (in)equality • cardinality 0/1 • datatypes • inverse, transitive, symmetric • hasValue • someValuesFrom • allValuesFrom RDF Schema • OWL Full • Allow meta-classes etc • OWL DL • Negation • Disjunction • Full Cardinality • Enumerated types OWL: more expressivity Full DL Lite
Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.
trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”
Question:Who writes the meta-data ? • exploit existing legacy-data • Amazon • Lab equipment? • side-effect from user interaction • MIT Lab photo-annotator • NOT from manual effort • Web 2.0 community/social interaction
Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Some working examples? • DOPE • HCLS (http://www.w3.org/2001/sw/hcls/)
DOPE: Background • Vertical Information Provision • Buy a topic instead of a Journal ! • Web provides new opportunities • Business driver: drug development • Rich, information-hungry market • Good thesaurus (EMTREE)
The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms
RDF Schema EMTREE RDF RDF Datasource 1 Datasource n …. Query interface Architecture:
Source Model (RDF) Additional Source of Data Gene Thesaurus (RDFS) Architecture: GUI: Spectacle (Aduna) http requests Mediator: Sesame (Aduna) SeRQL Document Model (RDFS) EMTREE Thesaurus (RDFS) SeRQL Source Model (RDF) SOAP Metadata Server (Collexis) Java Client