470 likes | 565 Views
* or rather rediscovers. From Database Federation to Model-Based Mediation: Databases Meets * Knowledge Representation. Bertram Lud ä scher LUDAESCH@SDSC.EDU Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego. Outline.
E N D
* or rather rediscovers From Database Federation to Model-Based Mediation: Databases Meets* Knowledge Representation Bertram Ludäscher LUDAESCH@SDSC.EDU Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego
Outline • Information Integration from a database perspective • examples, mediator approach, some technical challenges • Part I: XML-Based Mediation • based on querying semistructured data & XML • navigation-driven query evaluation • ongoing/future research: querying XML streams • Part II: Model-Based Mediation • basic ideas & architecture, lifting data to knowledge sources • “glue maps” (domain maps, process maps) • ongoing/future research: mix of DB & KR techniques • Summary
addall.com ? Information Integration public library WWW barnes&noble.com A1books.com amazon.com half.com An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” “One-World” Mediation
? Information Integration Crime Stats Demographics Realtor School Rankings A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? “Multiple-Worlds” Mediation
Information Integration from a DB Perspective • Information Integration Challenge • Given: data sources S_1, ..., S_k (DBMS, web sites, ...) and user questions Q_1,...,Q_n that can be answered using the S_i • Find: the answers to Q_1, ..., Q_n • The Database Perspective: source = “database” • S_i has a schema (relational, XML, OO, ...) • S_i can be queried • define virtual (or materialized) integrated viewsV over S_1,...,S_k using database query languages • questions become queries Q_i against V(S_1,...,S_k) • Why a Database Perspective? • scalability, efficiency, reusability (declarative queries), ...
Abstract XML-Based Mediator Architecture USER/Client Query Q o V (S_1,...,S_k) Integrated XML View V Integrated View Definition IVD(S1,...,Sn) MEDIATOR XML Queries & Results XML View XML View XML View Wrapper Wrapper Wrapper S_1 S_2 S_k
XQuery XQuery XQuery XSQL XSLT XPATH XSLT XSQL XPath http-get SQL XScan A Concrete (Future) XML-Based Mediator System USER/Client XQuery XML (Integrated View) MEDIATOR Engine Integrated View Definition IVD XQuery Processor XQuery First Results & Demos: XMAS language and algebra, VXD evaluation, BBQ UI, [WebDB99] [SSD99] [SIGMOD99] [EDBT00] (w/ Papakonstantinou, Vianu, ...) XML Queries & Results XML-Wrapper XML-Wrapper XML-Wrapper S3 S1 S2
Some Technical Challenges ... • XML Query Languages • DB community: QLs for semistructured data, e.g., TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logic[InfSystems98] • CSE/SDSC: XMAS[SSD99,WebDB99,EDBT00] • W3C: XPath, XSLT, XQuery (Working Draft , June 2001) • DB Theory: Expressiveness/Complexity Trade-Off • querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all • reasoning: query satisfiability, containment, equivalence
... Some More Technical Challenges ... • DB Practice: Query Composition • compute Q o V(S_1,...,S_k) w/o computing all of V • “push Q through V into S_i” • in Datalog: view unfolding (resolution, unification) + simplification ~ top-down evaluation ~ magic sets • in XML: some solutions (Papakonstantinou, ...) • Navigation-Driven Evaluation of Integrated View V: • V materialized => warehousing approach • V virtual => mediator approach • V virtual & driven by user-navigation => VXD approach [EDBT00] (w/ Papakonstantinou, Velikhov)
CONSTRUCT <books> <book> $a1 $t <pubs> $p{$p } </pubs> </book>{$a1, $t} </books> WHERE<books.book> $a1 : <author /> $t : <title /> </> IN "amazon.com" AND<authors.author> $a2 : <author /> <pubs> $p : <pub/> </> </> IN "www...DBLP… " AND value($a1) = value($a2 ) XMAS Algebra XMAS XMAS:XML Matching And Structuring language Integrated View Definition: “Find books from amazon.com and DBLP, join on author, group by authors and title”
XML (XMAS) Query Processing XMAS Query Q XMAS View Definition V Translator algebraic plans Composition (Qo V) composed plan Rewriter/Optimizer Compile-time optimized plan Run-time: lazy VXD evaluation Plan Execution
Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source Navigation-Driven Evaluation: Lazy Mediators
Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source Navigation-Driven Evaluation: Lazy Mediators
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Navigation-Driven Evaluation: Lazy Mediators Input: client navigations result view definition ans= V(S_1… S_k) Lazy Mediator Output: source navigations S_1 S_k ... XML source XML source
Open Issue: Querying XML Streams • Given: • stream S of XML events (open, close, data) • XML query Q over S • constraints: 1-pass “on-the-fly” processing, bounded memory • Find: • decide whether, and if so how, Q can be evaluated given the constraints • Initial Approach: • transducer model XSM (XML Stream Machine) to approximate “streamable” queries (w/ Papakonstantinou, Mukhopadhyay, Vianu)
Example: XML Stream Query XML query (r) = for each customer $C, list all orders $O Query-aware DTD design is even more important for stream queries!
Example: XML Stream Machine (XSM) input/output: stream of XML events memory: finite state control, buffers, transitions: on EVENT do ACTION transducer model
? Information Integration GeoPhysical (gravity contours) Geologic Map (Virginia) GeoChronologic (Concordia) Foliation Map (structure DB) GeoChemical A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? “Complex Multiple-Worlds” Mediation
? Information Integration sequence info (CaPROT) protein localization (NCMIR) morphometry (SYNAPSE) neurotransmission (SENSELAB) A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multiple-Worlds” Mediation
What’s the Problem with XML & Complex Multiple-Worlds? • XML is Syntax • canonical syntax for labeled ordered trees • a metalanguage, but all semantics lies outside of XML • DTDs => tags + nesting, XML Schema => DTDs + data modeling • need anything else? => write comments! • Domain Semantics is complex: • implicit assumptions, hidden semantics • sources seem unrelated to the non-expert • Need Structure and Semantics beyond XML trees! • employ richer OO models • make domain semantics and “glue knowledge” explicit • use ontologies to fix terminology and conceptualization • avoid ambiguities by using formal semantics
conceptual complexity/depth high Model-Based Mediation GO EcoCyc Bioinformatics Ontologies KR formalisms RiboWeb UMLS Geoinformatics Cyc Tambis WordNet Entrez MIA BLAST DB mediation techniques home-buyer 24x7 consumer addall book-buyer low conceptual distance multiple-worlds one-world Information Integration Landscape
Integrated-DTD := XML-QL(Src1-DTD,...) Integrated-CM := CM-QL(Src1-CM,...) Glue Maps DMs, PMs Logical Domain Constraints No Domain Constraints IF THEN IF THEN IF THEN Structural Constraints (DTDs), Parent, Child, Sibling, ... Classes, Relations, is-a, has-a, ... C1 A = (B*|C),D B = ... C2 R C3 . . .... .... .... XML Elements .... (XML) Objects XML Models Raw Data Raw Data ConceptualModels Raw Data XML-Based vs. Model-Based Mediation CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
What’s the Glue? What’s in a Link? Y X • Syntactic Joins • (X,Y) := X.SSN = Y.SSN equality • (X,Y) := X.UMLS-ID = Y.UID • “Speciality” Joins • (X,Y,Score) := BLAST(X,Y,Score) similarity • Semantic/Rule-Based Joins • (X,Y,C) := X isa C, Y isa C, BLAST(X,Y,S),S>0.8homology, lub • (X,Y,[produces,B,increased_in]) := X produces B, B increased_in Y. rule-based e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease • YAC (Yet Another Challenge): • compile semantic joins into efficient syntactic ones
Model-Based Mediation Methodology ... • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): • complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): • explicit representation of (“hidden”) source semantics • logic rules over OM(S) • Contextualization CON(S): • situate OM(S) data using “glue maps” (GMs): • domain maps DMs (ontology) = terminological knowledge: concepts + roles • process maps PMs = “procedural knowledge”: states + transitions
... Model-Based Mediation Methodology • Integrated View Definition (IVD) • declarative (logic) rules with object-oriented features • defined over CM(S), domain maps, process maps • needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): • mediator composes the user query Q with the IVD ... rewrites (Q o IVD), sends subqueries to sources ... post-processes returned results (e.g., situate in context)
USER/Client FL rule proc. “Glue” Maps GMs LP rule proc. CM (Integrated View) Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Domain Maps DMs Process Maps PMs GCM GCM GCM Mediator Engine Integrated View Definition IVD CM S1 CM S2 CM S3 XSB Engine Graph proc. semantic context CON(S) CM Queries & Results (exchanged in XML) CM(S) = OM(S)+KB(S)+CON(S) CM-Wrapper CM-Wrapper CM-Wrapper (XML-Wrapper) (XML-Wrapper) (XML-Wrapper) S3 S1 S2 Model-Based Mediator Architecture First results & Demos: KIND prototype, formal DM semantics, PMs [SSDBM00] [VLDB00] [ICDE01] [NIH-HB01] (w/ Gupta, Martone)
Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) DM in Description Logic Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR • Domain Map • = labeled graph with • concepts ("classes") and • roles ("associations") • additional semantics: expressed as logic rules (F-logic)
In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map... Source Contextualization & DM Refinement • sources can register new concepts at the mediator ...
Compilation : Domain Maps => F-Logic Rules • Domain Maps ~ Ontologies • DMs have a formal semantics via a translation to F-Logic (~ Datalog + OO features) • => Declarative + “Executable” Specification • query evaluation with deductive rules • reasoning over decidable fragments: • checking concept subsumption, equivalence
Integrated View Definition DERIVE protein_distribution(Protein, Organism, Brain_region, Feature_name, Anatom, Value) IF I:protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> {AS:anatomical_structure[name->Anatom]}] , % from PROLAB NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value]. Contextualization CON(Result) wrt. ANATOM. • provided by the domain expert and mediation engineer • deductive OO language (here: F-logic) Query results in context Query Processing “Demo”
Example: Inside Query Evaluation "How does the parallel fiber output (Yale/SENSELAB) relate to the distribution of Ryanodine Receptors (UCSD/NCMIR)?” push selection @SENSELAB: X1 := select targets of “output from parallel fiber”; determine source context @MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map; compute region of interest (here: downward closure) @MEDIATOR: X3 := subregion-closure(X2); push selection @NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors); compute protein distribution @MEDIATOR: X5 := compute aggregate(X4); display in context @MEDIATOR/GUI: display X5 incontext (ANATOM)
Some Open Database & Knowledge Representation Issues • Mix of Query Processing and Reasoning • FaCT description logic reasoner for DMs? • or reconcilation of DMs via argumentation-frameworks (“games”) using well-founded and stable models of logic programs [ICDT97,PODS97,TCS00] • Modeling “Process Knowledge” => Process Maps • formal semantics? (dynamic/temporal/Kripke models?) • executable semantics? (Statelog?) • Graph Queries over DMs and PMs • expressible in F-logic [InfSystem98] • scalability? (UMLS Domain Map has millions of entries) • ...
Towards Process Maps with Abstractions and Elaborations • nodes ~ states • edges ~ processes, transitions • blue/red edges: • processes in Src1/Src2 • general form of edges:
Federated Databases XML-Based Mediation Model-Based Mediation One-World One-/Multiple-Worlds Complex Multiple-Worlds Summary: Mediation Scenarios & Techniques Common Schema Mediated Schema Common Glue Maps SQL, rules XML query languagesDOOD query languages Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps DB expert DB expertKRDB + domain expert
Questions? Queries?
Some References • XML-Based and Model-Based Mediation: • MBM: Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. Conference on Data Engineering(ICDE), Heidelberg, Germany, IEEE Computer Society,2001. • VXD/Lazy Mediaors: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology(EDBT), Konstanz, Germany, LNCS 1777, Springer, 2000. • DOOD: Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998. • STATELOG (Logic Programming with States) • On Active Deductive Databases: The Statelog Approach, G. Lausen, B. Ludäscher, and W. May. In Transactions and Change in Logic Databases, Hendrik Decker, Burkhard Freitag, Michael Kifer, and Andrei Voronkov, editors. LNCS 1472, Springer, 1998. • Argumentation Frameworks as Games • Games and Total DatalogNeg Queries, J. Flum, M. Kubierschky, B. Ludäscher, Theoretical Computer Science, 239(2), pp.257-276, Elsevier, 2000. • Referential Actions as Logical Rules, B. Ludäscher, W. May, G. Lausen, Proc. 16th ACM Symposium on Principles of Database Systems(PODS'97), Tucson, Arizona, ACM Press, 1997.