450 likes | 600 Views
From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data. Bertram Lud ä scher LUDAESCH@SDSC.EDU Knowledge-Based Information Systems Lab San Diego Supercomputer Center and Department of Computer Science & Engineering University of California, San Diego. Outline.
E N D
From Data Integration To Semantic Mediation:Addressing Heterogeneities in Data Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information Systems Lab San Diego Supercomputer Center and Department of Computer Science & Engineering University of California, San Diego
Outline • Information Integration from a Database Perspective • XML-Based Data Integration • Model-Based / Semantic Mediation • Discussion
addall.com ? Information Integration barnes&noble.com A1books.com amazon.com half.com An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” “One-World” Scenario: XML-based mediator Mediator (virtual DB) (vs. Datawarehouse)
? Information Integration Crime Stats Demographics Realtor School Rankings A Home Buyer’s Information Integration Problem Which houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? “Multiple-Worlds” Scenario: XML-based mediator
? Information Integration sequence info (CaPROT) protein localization (NCMIR) morphometry (SYNAPSE) neurotransmission (SENSELAB) A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multiple-Worlds” Scenario: Model-based mediator
? Information Integration GeoPhysical (gravity contours) Geologic Map (Virginia) GeoChronologic (Concordia) Foliation Map (structure DB) GeoChemical A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? “Complex Multiple-Worlds” Scenario: Model-based mediator
Information Integration Challenges: Heterogeneities = S4... • System Aspects • platforms, devices, distribution, APIs, protocols, … • Syntaxes • heterogeneousdata formats (one for each tool ...) • Structures • heterogeneous schemas (one for each DB ...) • heterogeneousdata models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) • Semantics • unclear & “hidden” semantics : e.g., incoherent terminology, multiple / informal taxonomies, implicit assumptions, ...
Semantics Structure Syntax • reconciling S4heterogeneities • “gluing” together multiple data sources • bridging information and knowledge gaps computationally System aspects Information Integration Challenges • System aspects: “Grid” middleware • distributed data & computing • Web services, WSDL/SOAP, … • sources = functions, files, databases, … • Syntax & Structure: (XML-Based) Mediators • wrapping, restructuring • (XML) queries and views • sources = (XML) databases • Semantics: Model-Based/Semantic Mediators • conceptual models and declarative views • Semantic Web: ontologies, description logics, RDF(S), DAML+OIL, OWL, ... • sources = knowledge bases (DB+CMs+ICs)
Information Integration from a DB Perspective • Information Integration Problem • Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user questions Q1,..., Qn that can be answered using the Si • Find: the answers to Q1, ..., Qn • The Database Perspective: source = “database” • Si has a schema (relational, XML, OO, ...) • Sican be queried • define virtual (or materialized) integrated viewsV over S1 ,..., Skusing database query languages (SQL, XQuery,...) • questions become queries Qi against V(S1,..., Sk)
Outline • Information Integration from a Database Perspective • XML-Based Data Integration • Model-Based / Semantic Mediation • Discussion
book author author title “T.B. Lee” “B. Schatz” “SemWeb Tractat” book: title: “SemWeb Tractat” author: “B. Schatz” author: “T.B. Lee” Extensible Markup Language (XML) ... in their wonderful book called <title>SemWeb Tractat </title> by B. Schatz and T.B. Lee, the authors show how ... ... in their wonderful book called <title>SemWeb Tractat</title> by <author>B. Schatz</author> and <author> T.B. Lee</author>, the authors show how ... • (meta)language for marking uptext & data with user-definable tags • (X)HTML, XSLT, XML Schema, ... • MathML, BioML, GeoML, NeuroML, ... • XML-RPC, SOAP, WSDL, OWL, ... • semistructured tree data model • flexible: marked-up text, web-pages, databases, ... • container model: • “boxes within boxes” ... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how ... <book> <title>SemWeb Tractat</title> <author>B. Schatz</author> <author>T.B. Lee</author> </book>
USER/Client Query Q ( G (S1,..., Sk) ) Integrated Global XML View G Integrated View Definition G(..) S1(..)…Sk(..) MEDIATOR XML Queries & Results XML View XML View XML View Wrapper Wrapper Wrapper S1 S2 Sk XML-Based Mediator Architecture
Some Challenges in XML-Based Integration ... • XML Query/Transformation Languages • DB community: QLs for semistructured data, e.g., TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logic[InfSystems98] • CSE/SDSC: XMAS[SSD99,SIGMOD99,WebDB99,EDBT00] • W3C: XPath, XSLT, XQuery (Working Draft , June 2001) • XML Schema Languages • DTDs, RELAX NG, XML Schema, ... [XMLDM02] • DB Theoreticians: • Expressiveness/Complexity Trade-Off • querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all • reasoning: query satisfiability, containment, equivalence • ...
CONSTRUCT <books> <book> $a1 $t <pubs> $p{$p } </pubs> </book>{$a1, $t} </books> WHERE<books.book> $a1 : <author /> $t : <title /> </> IN "amazon.com" AND<authors.author> $a2 : <author /> <pubs> $p : <pub/> </> </> IN "www...DBLP… " AND value($a1) = value($a2 ) XMAS Algebra XMAS XMAS:XML Matching And Structuring language Integrated View Definition: “Find books from amazon.com and DBLP, join on author, group by authors and title” [QL98,SIGMOD99] [EDBT00]
XML (XMAS) Query Processing XML Query Q XML Global View Definition G(S) Translator algebraic plans Composition Q(G) composed plan Compile-time Rewriter/Optimizer: Q’(S) optimized plan Run-time:query evaluation Plan Execution
…New Challenges in (XML-Based) Mediation • Global-As-View (GAV) • user query Q global relations GQ(G) • global relations G source relations S G(S) • challenge: compute answers Q(G(V(S)))without computing all of V and G • query rewriting (with limited source capabilities): Q’(S) = Q(G) • Local-As-View (LAV) • user query Q global relations GQ(G) • source relations S global relations G S(G) • challenge: “reverse/rewrite rules” from S(G) to some G’(S) • answering queries using views: equivalent rewritings may not exist • find maximally contained ones: Q’(G’(S)) Q(G) • Inter(CS)disciplinary research needed: DB FP LP • GAV/LAV view (un)folding Clark’s completion, resolution, factoring
XSM network XQuery XSMs clearly outperform tree-based approaches on streamable queries (100x over Xalan) [A Transducer-Based XML Query Processor, Ludäscher Mukhopadhyay, Papakonstantinou, VLDB’02] Querying XML Streams: A New Frontier • New applications for stream-based XML processing: • Continuous, real-time data streams (wireless sensor networks, …) • Data / message transformation in Web services (SOAP, RMI, processing …) • Extract-transform-load applications (Tera/Peta-byte archival migration, …) • … leading to a new XML querying & transformation paradigm: • how to execute (some) XML queries & transformations on very large (infinite) data streams using only limited memory • XML stream machine (XSM): extended XML transducers with buffers
Outline • Information Integration from a Database Perspective • XML-Based Data Integration • Model-Based / Semantic Mediation • Discussion
? Information Integration sequence info (CaPROT) protein localization (NCMIR) morphometry (SYNAPSE) neurotransmission (SENSELAB) A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multiple-Worlds” Mediation
? Information Integration GeoPhysical (gravity contours) Geologic Map (Virginia) GeoChronologic (Concordia) Foliation Map (structure DB) GeoChemical A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? “Complex Multiple-Worlds” Mediation
What’s the Problem with XML & Complex Multiple-Worlds? • XML is Syntax • ... for labeled ordered trees • ... all semantics lies outside of XML • XML DTDs => tags + nesting • XML Schema => DTDs + data modeling • need anything else? => write comments! • Domain Semantics is Complex: • implicit assumptions, hidden semantics • sources seem unrelated to the non-expert • Need Structure and Semantics beyond trees! • employ richer OO models • make domain semantics and “glue knowledge” explicit • use ontologies to fix terminology and conceptualization • avoid ambiguities by using KR and formal semantics
conceptual complexity/depth high Model-Based Mediation GO EcoCyc Bioinformatics Ontologies KR formalisms RiboWeb UMLS Geo-, Ecoinformatics Cyc Tambis WordNet Entrez MIA BLAST DB mediation techniques home-buyer 24x7 consumer addall book-buyer low conceptual distance multiple-worlds one-world Information Integration Landscape
Integrated-DTD XML-QL(Src1-DTD,...) Integrated-CM CM-QL(Src1-CM,...) “Glue Maps” = Domain & Process Maps (ontologies) Logical Domain Constraints No Domain Constraints IF THEN IF THEN IF THEN Structural Constraints (DTDs), Parent, Child, Sibling, ... Classes, Relations, is-a, has-a, ... C1 A = (B*|C),D B = ... C2 R C3 . . .... .... .... XML Elements .... (XML) Objects XML Models Raw Data Raw Data ConceptualModels Raw Data XML-Based vs. Model-Based Mediation CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
What’s the Glue? What’s in a Link? Y • Syntactic Joins • (X,Y) := X.SSN = Y.SSN equality • (X,Y) := X.UMLS-ID = Y.UID • “Speciality” Joins • (X,Y,Score) := BLAST(X,Y,Score) similarity • Semantic/Rule-Based Joins • (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S),S>0.8homology, lub • (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease • CS Challenge: • compile semantic joins into efficient syntactic ones X
Semantic Mediation Methodology @ SOURCES • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): • complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): • explicit representation of (“hidden”) source semantics • logic rules over OM(S) • Contextualization CON(S): • situate OM(S) data using “glue maps” (ontologies): • domain maps DMs = terminological knowledge: concepts + roles • process maps PMs = “procedural knowledge”: states + transitions
Semantic Mediation Methodology @ MEDIATOR • Integrated View Definition (IVD) • declarative (logic) rules with object-oriented features • defined over CM(S), domain maps, process maps • needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): • mediator composes the user query Q with the IVD ... rewrites (Q o IVD), sends subqueries to sources ... post-processes returned results (e.g., situate in context)
USER/Client FL rule proc. “Glue” Maps GMs LP rule proc. CM (Integrated View) Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Process Maps PMs GCM GCM GCM Mediator Engine Integrated View Definition IVD CM S1 CM S2 CM S3 XSB Engine Graph proc. semantic context CON(S) CM Queries & Results (exchanged in XML) CM(S) = OM(S)+KB(S)+CON(S) CM-Wrapper CM-Wrapper CM-Wrapper (XML-Wrapper) (XML-Wrapper) (XML-Wrapper) S3 S1 S2 Model-Based Mediator Architecture First results & Demos: KIND prototype, formal DM semantics, PMs [SSDBM00] [VLDB00] [ICDE01] [NIH-HB01] [BNCOD02] [ER02] [EDBT02] [BioInf02]
Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) DM in Description Logic Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR • Domain Map • = labeled graph with • concepts ("classes") and • roles ("associations") • additional semantics: expressed as logic rules (F-logic)
In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... Source Contextualization & DM Refinement • sources can register new concepts at the mediator ...
Mediator View Definition DERIVE protein_distribution(Protein, Organism,Brain_region, Feature_name, Anatom,Value) WHERE I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS:anatomical_structure[name->Anatom]}] , % from PROLAB NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value]. Contextualization CON(Result) wrt. ANATOM. • provided by the domain expert and mediation engineer • deductive OO language (here: F-logic) Query results in context Query Processing Demo
Example: Inside Query Evaluation push selection @SENSELAB: X1 := select targets of “output from parallel fiber”; determine source context @MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map; compute region of interest (here: downward closure) @MEDIATOR: X3 := subregion-closure(X2); push selection @NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors); compute protein distribution @MEDIATOR: X5 := compute aggregate(X4); display in context @MEDIATOR/GUI: display X5 incontext (ANATOM) "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?” => DEMONSTRATION
Open Database & Knowledge Representation Issues • Mix of Query Processing and Reasoning • GAV & LAV with semantic query optimization (NIH BIRN, NSF GEON) • description logic reasoner for DMs (FaCT) ? • reconciliation of conflicting DMs via argumentation-frameworks (“games”) using well-founded and stable models of logic programs [ICDT97, PODS97, TCS00, TODS02] • Modeling “Process Knowledge” => Process Maps • formal semantics? (dynamic/temporal/Kripke models/Petri nets?) • executable semantics? (Statelog?) • Graph Queries over DMs and PMs • expressible in F-logic [InfSystem98] • scalability? (UMLS Domain Map has millions of entries) • How to incorporate “procedural features”? • Bioinformatics, Ecoinformatics, … => sources = DBs + analytical tools + … • scientific workflow planning and management (“promoter identification workflow” for DOE SciDAC, NSF/ITR SEEK)
nodes ~ states • edges ~ processes, transitions • blue/red edges: • processes in Src1/Src2 • general form of edges: related formalisms Process Maps with Abstractions and Elaborations:From Terminological to Procedural Glue
gi#’s from clusfavor Genomic gi# Chr # Gene location blast blast other species cDNA gi# Gene name blast human GC Island location Exon/intron location Repeats location Promoter location Genomic gi# Chr # Gene location GRAIL TRANSFAC Validates polII promoter location TAF’s Location on Genomic gi#’s Probabilities of match Probabilities of random match CLUSTAL Data Consolidation TRANSFAC Consensus sequences promoter location Shared TAF’s across cluster Common consensus sequence CLUSTAL blast blast Genomic gi# cDNA gi# Questions: Are chr#’s in common? Are chr#’s locations in common? Are there conserved upstream sequences? Are gene locations conserved across species Questions: RNA POLII promoter? GpC Island present? Are there common TAF’s across genomic gi#? Questions: Are there other common genes? Matthew Coleman, LLNL, 2002 A Scientific Workflow: Promoter Identification
SDM Demo & Architecture Translation Approach: Abstract Workflow (AWF) => Executable Workflow (EWF)
Federated Databases XML-Based Mediation Model-Based Mediation One-World One-/Multiple-Worlds Complex Multiple-Worlds Glue? Summary: Mediation Scenarios & Techniques Common Schema Mediated Schema Common Glue Maps SQL, rules XML query languages DOOD query languages Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps DB expert DB expertKRDB + domain experts
Outline • Information Integration from a Database Perspective • XML-Based Data Integration • Model-Based / Semantic Mediation • Discussion
Thank you! Questions? Queries?
Some References • Model-Based Mediation: • A Model-Based Mediator System for Scientific Data Management, B. Ludäscher, A. Gupta, M. Martone, Bioinformatics: Managing Scientific Data, Lacroix, Critchlow (eds), Morgan Kaufmann, to appear, 2003 • Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering(ICDE’01), Heidelberg, Germany, IEEE Computer Society, 2001. • Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998. • XML-Based Mediation: • VXD/Lazy Mediators: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology(EDBT’00), Konstanz, Germany, LNCS 1777, Springer, 2000. • XML Streams: A Transducer-Based XML Query Processor, B. Ludäscher, P. Mukhopadhyay, Y. Papakonstantinou, Intl. Conference on Very Large Databases (VLDB’02), Hong Kong, 2002
John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations Knowledge Representation:Relating Theory to the World via Formal Models “All models are wrong, but some are useful!”