490 likes | 973 Views
Lessons from the TSIMMIS Project. Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego. Overview. TSIMMIS’ goals, technical challenges, and solutions Insufficiencies of the TSIMMIS’ framework Going forward.
E N D
Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego
Overview • TSIMMIS’ goals, technical challenges, and solutions • Insufficiencies of the TSIMMIS’ framework • Going forward
Information Resides on Heterogeneous Information Sources Personal database Ticker Tape WWW Dialog • different interfaces • different data representations • redundant and conflicting information
Goal: System Providing Integrated View of Heterogeneous Data Integration System • collects and combines information • provides integrated view, uniform user interface Personal database Ticker Tape WWW Dialog
The Wrapper and Mediator Architecture Client Common Data Model portfolios for each company Mediator stock market prices business reports Wrapper Wrapper Ticker Tape Dialog
The Data Warehousing Approach to Integration Client Stored Integrated View Mediator Wrapper Wrapper Ticker Tape Dialog
The Lazy Integration Approach Query Decomposition, Translation and Result Fusion Client IBM portfolio Mediator IBM price IBM related reports (in common model) Wrapper Wrapper IBM related reports Ticker Tape Dialog
Wrappers & Mediators from High-Level Specifications Client Mediator Specification Interpreter Mediator Mediator Specification Wrapper Generator Wrapper Wrapper Wrapper Specification Source Source
Challenge: Sources Without a Well-Structured Schema Examples • semistructured • irregular • deeply nested • cross-referenced • incomplete schema knowledge • autonomous • dynamic • HTML pages • SGML documents • genome data • chemical structures • bibliographic information • results of the integration process
Challenge: Different and Limited Source Capabilities Client retrieve IBM data Mediator (U = A + B) retrieve IBM data retrieve IBM data Wrapper (A) Wrapper (B)
Mediator has to Adapt to Query Capabilities of Sources Client retrieve IBM data Mediator (U = A + B) retrieve IBM data retrieve IBM data retrieve everything (A) does not allow selection Wrapper (A) Wrapper (B)
Part B • Semistructured Data Representation • Mediator Generation • Wrapper Generation • Capabilities-Based Rewriting
Representation of Semistructured Information using OEM semantic object-id label Set Value <http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”> Atomic Value structural object-id
Graph Representation of OEM Data <http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”> http://www/~doe faculty first_name “John” last_name “Doe” rank “professor”
OEM Structures Represent Arbitrary Labeled Graphs http://www/~smith faculty name “Mary Smith” project “Air DB” paper author name “John Doe” author name “Mary Smith” title “Thin Air DB” http://www/~doe faculty first_name “John” last_name “Doe” rank “professor”
Overview • Semistructured Data Representation • Mediator Generation • Example of mediator specification • Language expressiveness • Implementation and performance • Wrapper Generation • Capabilities-Based Rewriting
Merge Information Relating to a Faculty faculty name “John Doe” rank “professor” birthday “April 1” papers ... s1 s2 faculty name “John Doe” rank “professor” papers ... person name “John Doe” birthday “April 1”
Mediator Specification Example faculty name “John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name “John Doe” rank “professor” papers ... person name “John Doe” birthday “April 1”
Mediator Specification Example: Semantics of Rule Bodies faculty name “John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<L V>}> :- <faculty {<name N> <LV>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name “John Doe” rank“professor” papers ... person name “John Doe” birthday “April 1”
Mediator Specification Example: Semantics of Rule Heads “John Doe” faculty name “John Doe” rank“professor” birthday “April 1” papers ... <N faculty {<LV>}> :- <faculty {<name N> <LV>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name “John Doe” rank“professor” papers ... person name “John Doe” birthday “April 1”
Incrementally Add to Semantically Identified Object “John Doe” faculty name“John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<LV>}> :- <faculty {<name N> <LV>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name“John Doe” rank “professor” papers ... person name “John Doe” birthday “April 1”
Irregularities & Incomplete Schema Knowledge “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers faculty name “Mary Smith” project “Air DB” “Mary Smith” <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 s1 faculty name “John Doe” rank “professor” papers faculty name “Mary Smith” project “Air DB” s2 person name “John Doe” birthday “April 1”
Second Rule Attaches More Subobjects to View Objects “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<LV>}> :- <person {<name N> <LV>}>@s2 s1 s2 faculty name “John Doe” rank “professor” papers ... person name “John Doe” birthday“April 1”
Language Expressiveness • Information fusion problems solved by MSL • Irregularities • Incomplete knowledge of source structure • Transformation of cross-referenced structures • Inconsistent and redundant data • Use of arbitrary matching criteria • Theoretical analysis of expressiveness • Consider the relational representation of OEM graphs. Then MSL is equivalent to “SQL + special form of transitive closure”
Inconsistent and Redundant Information “John Doe” faculty name “John Doe” rank “associate” rank “assistant” <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 AND NOT <faculty {<name N> <L V1>}>@s1 s1 s2 faculty name “John Doe” rank “associate” person name “John Doe” rank “assistant”
Overview • Semistructured Data Representation • Mediator Generation • Example of mediator specification • Language expressiveness • Implementation and performance • Wrapper Generation • Capabilities-Based Rewriting
Mediator Specification Interpreter Architecture Result Query Mediator Specification Query Rewriter logical datamerge program Cost-Based Optimizer plan Datamerge Engine Queries to Wrappers Results
Query Rewriting When Known Origins of Information • <N faculty {<salary S>}> :- :- <faculty {<name N> <salary S>}>@s1 <N faculty {< rank R >}> :- <person {<name N> <rank R>}>@s2 • <well-paid {<name N> <salary X>}> :- <N faculty {<salary X> <rank assistant>}> AND X>65000
Query Rewriter PushesConditions to Sources • <N faculty {<salary S>}> :- :- <faculty {<name N> <salary S>}>@s1 <N faculty {< rank R >}> :- <person {<name N> <rank R>}>@s2 • <well-paid {<name N> <salary X>}> :- <N faculty {<salary X> <rank assistant>}> AND X>65000 • logical datamerge program <well-paid {<name N> <salary X>}> :- (<faculty {<name N> <salary X>}> ANDX>65000)@s1AND <person {<name N> <rank assistant>}>@s2
Passing Bindings & Local Join Plans Passing Bindings s1 s2 <salary X> :- <faculty {<name $N> <salary X>}> AND X>65000 <name N> :- <person {<rank assistant>}> Local Join s1 s2 <a {<s X> <n N>}>:- <faculty {<name N> <salary X>}> AND X>65000 N <name N> :- <person {<rank assistant>}>
Query Decomposition When Unknown Origins of Information <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 <X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}>
Plan Considers All Possible Sources of birthday <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 <X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}> s1 s2 birthday name name birthday
Overview • Semistructured-Data Representation • Mediator Generation • Wrapper Generation • Capabilities-Based Rewriting
Query Translation in Wrappers SELECT * FROM person SELECT * FROM person WHERE name=“Smith” Wrapper Query Translator Result Translator find -all find -n Smith Source
Rapid Query Translation Using Templates and Actions SELECT * FROM person SELECT * FROM person WHERE name=“Smith” SELECT * FROM person {emit “find -all” } SELECT * FROM person WHERE name=$N {emit “find -n $N”} Template Interpreter Result Translator find -all find -n Smith Source
Description of Infinite Sets of Supported Queries • uses recursivenonterminals • Example: • job description contains word w1 and word w2 and ... • SELECT subset(person) FROM person WHERE \CJob\CJob: job LIKE $W AND \CJob \CJob: TRUE
Overview • Semistructured-Data Representation • Mediator Generation • Wrapper Generation • Capabilities-Based Rewriting
Capabilities-Based Rewriter in Mediator Architecture Query logical datamerge program Mediator Specification Query Rewriter Capabilities- Based Rewriter supported plans Cost-Based Optimizer optimal plan Datamerge Engine Wrapper Supported Queries Description Wrapper Supported Queries Description
Capabilities-Based Rewriter Finds Supported Plans SELECT * FROM A WHERE salary>65000 Supported Queries SELECT * FROM A
Capabilities-Based Rewriter Finds Most-Selective Supported Plans SELECT * FROM B WHERE salary>65000 Supported Queries SELECT * FROM B WHERE salary >65000 SELECT * FROM B
Capabilities-Based Rewriter Architecture Query Query Capabilities Description Component SubQuery Discovery Component SubQueries Plan Construction Plans (not fully optimized) Plan Refinement Algebraically optimal plans
What TSIMMIS Achieved • system for integration of heterogeneous sources • challenges and solutions • semistructured data & incomplete schema knowledge • appropriate specification language and query processing algorithms • limited and different query capabilities • query translation algorithm • capabilities-based query rewriting algorithm
Overview • TSIMMIS’ goals, technical challenges, and solutions • Insufficiencies of the TSIMMIS’ framework • Going forward
Insufficiencies of the TSIMMIS framework • OEM was really unstructured data • some loose and partial schematic info may pay off tremendously • too “databasy” user/mediator/source interaction
Overview • TSIMMIS’ goals, technical challenges, and solutions • Insufficiencies of the TSIMMIS’ framework • Going forward
Web emerges as a Distributed DB and XML as its Data Model XMAS Query Language Also export: 1. Schemas & Metadata (XML-Data, RDF,…) 2. Description of supported queries XML View Document(s) XML View Document(s) XML View Document(s) Data Source Wrapper Native XML Database Legacy Source
Definition of Integrated Views Integrated XML View View Definition in XMAS Mediator XML View Document(s) XML View Document(s) XML View Document(s) Data Source Data Source Data Source
Non-Materialized Views in the MIX mediator system Blended Browsing & Querying (BBQ) GUI Application XMAS query XML document Integrated View DTD DOM for Virtual XML Doc’s View Definition in XMAS MIX Mediator DTD Inference Query Processor Source DTD XML Source XML Source
Application XML Document Fragments Blended Browsing & Querying (BBQ) GUI DOM (VXD) Client API XMAS Query View DTD MIX Mediator XMAS Mediator View Definition Resolution Unfolded Query DTD Inference Simplification Translation to Algebra Optimization DTD Execution XMAS Query XML Document Fragments XML Source 1 RDB2XML Wrapper XML Source 2 RDB