210 likes | 379 Views
EDBT Panel, March 2002, Prague: Scientific Data Integration for Complex Multiple-Worlds Scenarios: Databases Meets Knowledge Representation. Bertram Lud ä scher Data and Knowledge System San Diego Supercomputer Center U.C. San Diego. ? Information Integration. Crime Stats.
E N D
EDBT Panel, March 2002, Prague:Scientific Data Integration for Complex Multiple-WorldsScenarios: Databases Meets Knowledge Representation Bertram Ludäscher Data and Knowledge System San Diego Supercomputer Center U.C. San Diego EDBT'02, Prague
? Information Integration Crime Stats Demographics Realtor School Rankings A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? “Simple Multiple-Worlds” Mediation Problem => XML-Based Mediator
? Information Integration sequence info (CaPROT) protein localization (NCMIR) morphometry (SYNAPSE) neurotransmission (SENSELAB) A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multiple-Worlds” Mediation Problem => Model-Based Mediator
? Information Integration GeoPhysical (gravity contours) Geologic Map (Virginia) GeoChronologic (Concordia) Foliation Map (structure DB) GeoChemical A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? “Complex Multiple-Worlds” Mediation
Scientific Data Integration Challenges: Heterogeneities in the 4S’s ... • System Aspects • platforms, devices, phys. distribution, transport protocols, access APIs, impedance mismatch, user interfaces, application integration ... • Syntaxes • heterogeneousdata formats (one for each tool ...) • Structures • heterogeneous schemas (one for each DB ...) • heterogeneousdata models (RDBs, ORDBs, OODBs, XMLDBs) • Semantics • unclear semantics: e.g., incoherent terminology, multiple taxonomies, ... EDBT'02, Prague
Semantics Structure Syntax System aspects Data Integration: Approaches / Solutions • (Data-)Grid / Middleware • system: distributed data & computing (SDSC SRB, Globus, web services, WSDL) • source = file or DB • XML-Based Mediators • structure: XML queries and views • source = XML-DB • Model-Based/Semantic Mediators • semantics: conceptual models and declarative views • source = Knowledge Base (DB+CMs+ICs) • Semantic Web Formalisms • semantics: ontologies, description logics (RDF(S), DAML+OIL,...) • Knowledge/Semantic-Grid • combination EDBT'02, Prague
What’s in a Link? Y X • Syntactic Joins • (X,Y) := X.SSN = Y.SSN equality • (X,Y) := X.UMLS-ID = Y.UID • “Speciality” Joins • (X,Y,Score) := BLAST(X,Y,Score) similarity • Semantic/Rule-Based Joins • (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S),S>0.8homology, lub • (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease • Challenge: • compile semantic joins into efficient syntactic ones EDBT'02, Prague
Integrated-DTD := XQuery(Src1-DTD,...) Integrated-CM := CM-QL(Src1-CM,...) “Glue” Maps Domain Maps Process Maps Logical Domain Constraints No Domain Constraints IF THEN IF THEN IF THEN Structural Constraints (DTDs), Parent, Child, Sibling, ... Classes, Relations, Ontologies is-a, has-a, ... C1 A = (B*|C),D B = ... C2 R C3 . . .... .... .... XML Elements .... (XML) Objects XML Models Raw Data Raw Data ConceptualModels Raw Data XML-Based vs. Model-Based Mediation CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
NCMIR ANATOM Domain Map: • concepts • relations • logic rules
Source 2 Source 1 Source 3 Purkinje Neuron has a has a Cerebellar Cortex Granule Cell Layer Purkinje Cell layer Molecular Layer Purkinje Cell Dendrite Dendritic spines Dendritic shaft Endoplasmic reticulum Semantics-Aware Browsing and Querying Cerebellum EDBT'02, Prague
Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) DM in Description Logic Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR • Domain Map • = labeled graph with • concepts ("classes") and • roles ("associations") • additional semantics: expressed as logic rules (F-logic) EDBT'02, Prague
Source registersdata with an existing ontology, using description logics it may also refine the mediator’s domain map... [ICDE01] Source Registration/Data Contextualization • sources can register new concepts at the mediator ... EDBT'02, Prague
Source Registration: Semantic Annotations EDBT'02, Prague
Multiple Ways of Querying Data Brain has a Cerebellum has a Purkinje Cell Layer has a Purkinjecell Spatial Representation (Atlases) is a neuron Ontologies Transformations EDBT'02, Prague
USER/Client FL rule proc. “Glue” Maps GMs LP rule proc. CM (Integrated View) Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Process Maps PMs GCM GCM GCM Mediator Engine Integrated View Definition IVD CM S1 CM S2 CM S3 XSB Engine Graph proc. semantic context CON(S) CM Queries & Results (exchanged in XML) CM-Wrapper CM-Wrapper CM-Wrapper (XML-Wrapper) (XML-Wrapper) (XML-Wrapper) S3 S1 S2 Model-Based Mediator Architecture First Results & Demos: [SSDBM’00] [VLDB’00] [ICDE’01] [HBP’01] [EDBT’02][BNCOD’02] • Conceptual Model = • Object Model • Knowledge Base • Contextualization EDBT'02, Prague
Model-Based Mediation Methodology ... • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): • complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): • explicit representation of (“hidden”) source semantics • logic rules over OM(S) • Contextualization CON(S): • situate OM(S) data using “glue maps” (GMs): • domain maps DMs (ontology) = terminological knowledge: concepts + roles • process maps PMs = “procedural knowledge”: states + transitions EDBT'02, Prague
... Model-Based Mediation Methodology • Integrated View Definition (IVD) • declarative (logic) rules with object-oriented features • defined over CM(S), domain maps, process maps • needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): • mediator composes the user query Q with the IVD ... rewrites (Q o IVD), sends subqueries to sources ... post-processes returned results (e.g., situate in context) EDBT'02, Prague
Federated Databases XML-Based Mediation Model-Based Mediation One-World One-/Multiple-Worlds Complex Multiple-Worlds Mediation Scenarios & Techniques Common Schema Mediated Schema Common Glue Maps SQL, rules XML query languagesDOOD query languages Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps DB expert DB expertKRDB + domain experts EDBT'02, Prague
Some Observations • Scientific Data Integration is different • e.g., complex and hidden semantics,... • Co-Education (CS=>DS, DS=>CS) takes time • NIH BioInformatics Research Network (BIRN) – Neuroscientists • DOE Scientific Data Management Center (SDM) • Starting with Ecologists, Geoscientists, ... • A good thing about standards: • There are so many to choose from: • SQL, http, HTML, XML, XQuery, XSLT, XML Schema, RDF(S), DAML+OIL, DAML-S, UMLS, GO, XMI, SOAP, WSDL, ... • Syntax is overrated (and its impact underestimated?) • nobody likes LISP any more, but everybody likes XML ... • 2nd Marriage of Knowledge Representation & Databases: • Semantic Web • (child from 1st marriage: Deductive Databases; aren’t they cute siblings? ;) => model-based/semantic mediators EDBT'02, Prague
’ Logic Integrated Data Views inference description logics subsumption RDB ORDB ontologies XMLDB Scientific Data RDF DOOD rules WSDL XQuery DAML-S XML RDF Internet2 DAML The Road Ahead: Scientific Data Integration with the Semantic Web !? Ivory Tower SOAP OIL Data-Grid EDBT'02, Prague
Some Related References: Mediation of Neuroscience Data • Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering (ICDE), Heidelberg, Germany, IEEE Computer Society, April 2001. • Navigating Virtual Information Sources with Know-ME, X. Qian, B. Ludäscher, M. E. Martone, A. Gupta, demonstration track, Intl. Conference on Extending Database Technology (EDBT), Prague, Czech Republic, March 2002. • Model-Based Information Integration in a Neuroscience Mediator System, B. Ludäscher, A. Gupta, M. E. Martone, demonstration track, 26th Intl. Conference on Very Large Databases (VLDB), Cairo, Egypt, September 2000. • Knowledge-Based Integration of Neuroscience Data Sources, A. Gupta, B. Ludäscher, M. E. Martone, 12th Intl. Conference on Scientific and Statistical Database Management (SSDBM), Berlin, Germany, IEEE Computer Society, July 2000. • A Cell-Centered Database for Electron Tomographic Data, M. E. Martone, A. Gupta, M. Wong, X. Qian, G. Sosinsky, S. Lamont, B. Ludäscher , and M. H. Ellisman. Journal of Structural Biology, 2002. to appear EDBT'02, Prague