500 likes | 618 Views
Scientific RDF Databases. Michael Mertens K.U.Leuven. Outline. Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism. Outline. Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism. Introduction.
E N D
Scientific RDF Databases Michael Mertens K.U.Leuven
Outline • Introduction to RDF • RDF Databases • Advantages for scientific R&D • In practice • Criticism
Outline • Introduction to RDF • RDF Databases • Advantages for scientific R&D • In practice • Criticism
Introduction RDF: Resource Description Framework • Originally: metadata data model • Now: General method for conceptual description for web resources (Semantic Web)
Introduction >Semantic Web • Traditional Web in 2009: • Sharing documents • URL as retrieval mechanism • HTML standard format • Hypertext links Image taken from “The Emerging Web of Linked Data”, Chris Bizer
Introduction >Semantic Web • Data on the web • HTML describes documents and links between them • Semantic web: • Publish data in RDF, OWL, XML, .. • Describe arbitrary things: people, books, events, .. • Link between these concepts • Machine-readable, web-accessible databases
Introduction > Semantic Web > Linked Data • Tim-Berners Lee: LINKED DATA • Connected structured data • 3 simple principles: • URLs for conceptual things • Returns useful data about that thing • Relationships link to other URLs
Introduction > Semantic Web > Linked Data > Example • Before: Scientific data usually not shared • Pharmaceutical Drug Discovery • A lot of spread out data • Drug Bank, ClinicalTrial.gov, Health Care and Life Science • Genomics data, Protein data, .. • A question nobody examined before: “What Proteins are involved in signal transduction AND are related to pyramidal neurons?” Example taken from “Tim Berners-Lee on the next Web”
Introduction > Semantic Web > Linked Data > Example • The web: 223,000 hits, 0 results Example taken from “Tim Berners-Lee on the next Web”
Introduction > Semantic Web > Linked Data > Example • Linked Data: 32 hits, 32 results DRD1, 1812 adenylate cyclase activation ADRB2, 154 adenylate cyclase activation ADRB2, 154 arrestin mediated desensitization of G-protein coupled … DRD1IP, 50632 dopamine receptor signaling pathway DRD1, 1812 dopamine receptor, adenylatecyclase activating pathway DRD2, 1813 dopamine receptor, adenylatecyclase inhibiting pathway GRM7, 2917 G-protein coupled receptor protein signaling pathway GNG3, 2785 G-protein coupled receptor protein signaling pathway GNG12, 55970 G-protein coupled receptor protein signaling pathway DRD2, 1813 G-protein coupled receptor protein signaling pathway ADRB2, 154 G-protein coupled receptor protein signaling pathway CALM3, 808 G-protein coupled receptor protein signaling pathway HTR2A, 3356 G-protein coupled receptor protein signaling pathway DRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second… SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second… MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide … HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second … GRIK2, 2898 glutamate signaling pathway GRIN1, 2902 glutamate signaling pathway GRIN2A, 2903 glutamate signaling pathway GRIN2B, 2904 glutamate signaling pathway ADAM10, 102 integrin-mediated signaling pathway GRM7, 2917 negative regulation of adenylatecyclase activity LRP1, 4035 negative regulation of Wnt receptor signaling pathway ADAM10, 102 Notch receptor processing ASCL1, 429 Notch signaling pathway HTR2A, 3356 serotonin receptor signaling pathway ADRB2, 154 transmembrane receptor protein tyrosine kinase … PTPRG, 5793 transmembrane receptor protein tyrosine kinase … EPHA4, 2043 transmembrane receptor protein tyrosine kinase … NRTN, 4902 transmembrane receptor protein tyrosine kinase … CTNND1, 1500 Wnt receptor signaling pathway Example taken from “Tim Berners-Lee on the next Web”
Introduction > Semantic Web > Linked Data > Example PREFIX go: <http://purl.org/obo/owl/GO#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX mesh: http://purl.org/commons/record/mesh/ SELECT ?genename ?processname WHERE { graph http://purl.org/commons/hcls/pubmesh { ?paper ?p mesh:D017966. ?article sc:identified_by_pmid ?paper. ?gene sc:describes_gene_or_gene_product_mentioned_by ?article.} graph <http://purl.org/commons/hcls/goa> { ?protein rdfs:subClassOf ?res. ?res owl:onProperty ro:has_function. ?res owl:someValuesFrom ?res2. ?res2 owl:onProperty ro:realized_as. ?res2 owl:someValuesFrom ?process. graph <http://purl.org/commons/hcls/20070416/classrelations> {{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union { ?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent. ?parent owl:equivalentClass ?res3. ?res3 owl:hasValue ?gene.} graph <http://purl.org/commons/hcls/gene> { ?gene rdfs:label ?genename } graph <http://purl.org/commons/hcls/20070416> { ?process rdfs:label ?processname}} Related to Pyramidal Neurons Part of Signal Transduction Used 4 sources Example taken from “Tim Berners-Lee on the next Web”
Introduction > Semantic Web > Linked Data
Introduction > Semantic Web > Linked Data
Introduction > Semantic Web > Linked Data • What do we need? • Identifiers: URIs • Linking mechanism: HTTP • Vocabulary: Web Ontology Language (OWL) • Serialization: RDF/XML
Introduction > Semantic Web > Linked Data • Identifiers: URIs • Use of HTTP URL • Link to “Resources” • Possibly many documents per resource • Shift to non-information resources: http://dbpedia.org/resource/London HTML: http://dbpedia.org/page/London RDF: http://dbpedia.org/data/London.rdf N3: http://dbpedia.org/data/London.ntriples
Introduction > Semantic Web > Linked Data • Linking mechanism: HTTP • Accessible through generic data browsers • Allowing to be crawled by search engines • Connecting different sources • In contrast, Web APIs use different interfaces
Introduction > Semantic Web > Linked Data • Vocabulary: Web Ontology Language (OWL) • Knowledge representation language • Designed to be interpreted by computers • Describes data, based on individuals (classes) and property assertions (relationships) <owl:Class rdf:ID="Money"> <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/> </owl:Class> <owl:DatatypeProperty rdf:ID="currency"> <rdfs:domain rdf:resource="#Money"/> <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/> </owl:DatatypeProperty>
Introduction > Semantic Web > Linked Data • Vocabulary: Web Ontology Language (OWL) • Knowledge representation language • Designed to be interpreted by computers • Describes data, based on individuals (classes) and property assertions (relationships) • URIs about the same thing: ‘owl:sameAs’
RDF: Resource Description Framework • Based on triples • Subject, predicate, object • Resources identified by URI • URIs allow to look up RDF information • RDF information links to other URIs < http://dbpedia.org/resource/London, http://dbpedia.org/ontology/country, http://dbpedia.org/resource/United_Kingdom >
RDF: Resource Description Framework This looks a lot like XML.. Why don’t we just use XML??
RDF vs XML RDF: <Page, author, Name> XML: <document href=“Page”> <author>Name</author> </document> <document> <details> <uri>Page</uri> <author>Name</author> </details> </document> <author> <uri>Page</uri> <name>Name</name> </author> ...
RDF: Serialization • RDF/XML: proposed by W3C • N3 or Turtle: human-readability <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn"> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> @prefix dc: <http://purl.org/dc/elements/1.1/>. <http://en.wikipedia.org/wiki/Tony_Benn> dc:title "Tony Benn"; dc:publisher "Wikipedia".
Outline • Introduction to RDF • RDF Databases • Advantages for scientific R&D • In practice • Criticism
RDF Databases • Also called “Triple Store” • Data in the form of triples: Subject – predicate – object • Dominant query language: SPARQL PREFIX abc: <nul://sparql/exampleOntology#> . SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y. ?y abc:countryname ?country ; abc:isInContinent abc:Africa. }
RDF Databases • Built on W3C’s “Linked Data” • Subset of “Graph databases” • Nodes (entities), edges (relationships), properties Directed, labeled graph structure (Predicate URI as label)
Graph View Image taken from w3.org
RDF Databases • Only standarised NoSQL database • In contrast to normal RDBMS: • Very flexible data model • Do not require fixed table schema • Information as most basic building blocks • Enabling improvement on data-intensive operations • Examples: Ebay, Facebook, digg, ..
RDF Databases • Scalable: Distributed design • Self-Documenting Data • Vocabulary identified in OWL or RDFS definitions • Allows multiple schemata • Open • Discover new data sources at run-time • Often weak consistency guarantees • Solved with additional middleware
RDF Databases Limitations of Relational Databases: • Not directly visible to web-agents • Primary-foreign key relationships • Meaning is implicit, unspecified semantics • No relationships across seperate databases • Parent-child relationship are not natural • “Self-joins” for each level in hierarchy
Outline • Introduction to RDF • RDF Databases • Advantages for scientific R&D • Criticism • In practice
Advantages for Scientific R&D • Studies continue to show that research in all fields is increasingly collaborative • Example: genomic research • Complex data distributed over many datasets • Entrez Gene (EG), Gene Ontology (GO), Swiss_Prot, GenBank, ..
Advantages for Scientific R&D • Problem = Lack of well defined standards • Integration Nightmare: • data scattered, different formats, lacking information • synonyms, ambiguity • Changing models: • maintenance not feasible • Understanding and reasoning • need for connecting ontologies • Challenge: Syntatic and Semantic heterogeneity
Integration of Databases > Challenges • Localization of resources • Identify relevant webresources • Data formats • Resources are represented in HTML, TXT, images, .. • Synonyms • Researchers can name their own data differently
Integration of Databases > Challenges • Ambiguity • E.g. “insulin” can represent a drug, protein, gene, .. • Relations • One-to-one / One-to-many between identifiers • Granularity • Can cause missing data, ..
Integration of Databases > Approaches • Data Warehouse Approach • Translate data in one local database • Eliminate unavailability & slow response • Allow data processing and optimalization • Maintenance problem • evolution of content and structure • Examples: BioWarehouse, Biozon, DataFoundry
Integration of Databases > Approaches • Federated Database Approach • Translate queries for individual sources • Easier to maintain (e.g. Adding new source) • Poor performance • Examples: BioKleisli, DiscoveryLink, QIS
Integration of Databases > Approaches • Semantic Web Approach • No need to map data models • Rely on standarized ontologies • Less work, better performance • But only if sources comply
Outline • Introduction to RDF • RDF Databases • Advantages for scientific R&D • In practice • Criticism
In Practice • Scientists need: • Access to data • Ability to utilize data • Handle uncertainty
In Practice • Linked Open Data: • “We all need the same databases, for different decisions or applications” • Complements data in internal/licensed sources • Stimulates cross scientific sharing
Examples • Biological data: Human Genome Project • Increase in web-accessible databases • GenBank, Gene Ontology, UniProt, PhenoDB, .. • Integration is key problem • Increase in RDF availability
Examples • YeastHub • Registration of web-accessible database • Metadata according to Dublin Core standards using RSS1.0 to describe an ontology • Data Conversion • XML or RDB to RDF conversion • (eg Unique ID = RDF ID , rest of columns are properties) • Data Integration • Ad hoc RDF queries • Form-based queries (supervised)
Outline • Introduction to RDF • RDF Databases • Advantages for scientific R&D • In practice • Criticism
Criticism • Feasability • Human behavior and personal preferences • ‘Database hugging’ • Organizations tend to keep data for themselves • Censorship and Privacy
Criticism • Published data reusable in research? • Requires: • Provenance information • Quality • Attribution • Consistency • ... • Out-of context data fails to respect scientific research methodology
References • Bringing Web 2.0 to bioinformatics2008, Zhang Zhang, Kei-Hoi Cheung and Jeffrey P. Townsend • Semantic web approach to database integration in life sciences2006, Kei-Hoi Cheung, Andrew K. Smith, Kevin Y.L. Yip, Christopher J.O. Baker and Mark B. Gerstein • Integrating large biomedical knowledge resources with RDF2007, Satya S. Sahoo, Olivier Bodenreider, Kelly Zeng, Amit Sheth • RDF/RDFS-based Relational Database Integration2006, Huajun Chen , Zhaohui Wu , Heng Wang , Yuxin Mao
Discussion • Has anyone ever worked with linked (RDF) data before? What are your experiences? • Will the semantic web grow to become the Giant Global Graph? • Why haven’t RDF databases taken off like Relational Databases?