380 likes | 485 Views
The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). Pharmaceutical Productivity. Source: PhRMA & FDA 2003. Kenneth Griffiths and Richard Resnick
E N D
The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam
Pharmaceutical Productivity Source: PhRMA & FDA 2003
Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003 The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources
What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …
Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty
“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)
The Medical tower of Babel • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCI Cancer Ontology: • 17,000 classes (about 1M definitions),
Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006
alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA
name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption
Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust is this page really about Pamela Anderson?
What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines
standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines
Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),
Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Stack of languages • XML: • Surface syntax, no semantics • XML Schema: • Describes structure of XML documents • RDF: • Datamodel for “relations” between “things” • RDF Schema: • RDF Vocabular Definition Language • OWL: • A more expressive Vocabular Definition Language
Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.
trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”
Question:Who writes the meta-data ? • exploit existing legacy-data • Databases • Lab equipment • (Amazon) • side-effect from user interaction • email keyword extraction • NOT from manual effort
Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached
Some working examples? • DOPE
DOPE: Background • Vertical Information Provision • Buy a topic instead of a Journal ! • Web provides new opportunities • Business driver: drug development • Rich, information-hungry market • Good thesaurus (EMTREE)
The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms
RDF Schema EMTREE RDF RDF Datasource 1 Datasource n …. Query interface Architecture:
Ontology disambiguates query
Ontology refines query
Some working examples? • DOPE • HCLS (http://www.w3.org/2001/sw/hcls/)
RDF Schema EMTREE RDF RDF Datasource 1 Datasource n Query interface Architecture: RDF Schema …. Gene Ontology ….
Summarising… • Data integration on the Web: • machine processable data besides human processable data • Syntax for meta-data • Representation • Inference • Vocabularies for meta-data • Lot’s of them in bio-inf. • Actual meta-data: • Lot’s in bio-inf. • Will enable: • Better search engines (recall, precision, concepts) • Combining information across pages (inference) • …
Things to do for you • Practical: Use existing software to construct new use-scenario’s • Conceptual:Create on ontology for some area of bio-medical expertise • from scratch • as a refinement of an existing ontology • Technical:Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)