340 likes | 480 Views
Data Integration: A Status Report. Alon Halevy University of Washington, Seattle BTW 2003. Data Integration Report. Recent progress Mediation languages Query processing (XML and other) Commercial Current challenges Flexible architectures: peer-data mgmt.
E N D
Data Integration:A Status Report Alon Halevy University of Washington, Seattle BTW 2003
Data Integration Report • Recent progress • Mediation languages • Query processing (XML and other) • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. BTW 2003
Data Integration Systems • This is one possible architecture (virtual integration) • Only logical mediated schema is central. Data stays at the sources.
Motivation and Activity • Application areas of data integration: • Enterprise information integration ($$) • The government • Data sources on the web • Scientific data sharing. • Many research projects: • Mine: Information Manifold, Tukwila, LSD. • Companies: • Many startups, big guys getting in. BTW 2003
Outline • Recent progress • Mediation languages • Adaptive Query processing • XML data management • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. • Crossing the Structure Chasm. BTW 2003
Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Mediation Languages Goal: Mediated Schema Language for Specifying Semantic relationships BTW 2003
Source Source Source Source Source Global-as-View (GAV) Create view Actor AS R1 Union Select A,B From S2 Union … Mediated Schema Title, Actor, … R1 R2 R3 R4 R5 BTW 2003
Source Source Source Source Source Local-as-View (LAV) (GLAV) Create View R5 as Select * From Movie Where lang=“German” Create View R1 as Select title, name From Title Join Actor Where Year>1970 Mediated Schema Title, Actor … R1 R2 R3 R4 R5 BTW 2003
Adaptive Query Processing • Problem: no stats, network unstable • Cannot ‘Plan and then execute’ • Need to adapt plan during execution. • Idea already in Ingres (1976) • Proposed before data integration: • Cole and Graefe (choose nodes) • Kabra and Dewitt (mid-query re-opt). BTW 2003
Convergent Query Processing[Zack Ives, Ph.D 2002, U. Penn] • Processor starts with initial plan • Monitors execution, accumulating stats. • Switches plan when a better one found • Reuses intermediate results. • Final, cleanup phase. • Possible transformation types: • Plan partitioning, data partitioning, low-level rescheduling. • Can be aggressive (e.g., with aggregations). BTW 2003
XML Query Processing • XML facilitates integration. • Mediator query processor may manipulate XML directly. • Progress on: • Publishing to XML, XML views on relations • Physical algebras for manipulating XML • Optimization of XQuery. BTW 2003
The Commercial World • Some startups: • Nimble, MetaMatrix, Calixa, Enosys, … • Big guys making announcements: • IBM, BEA, MS, (Oracle still being defiant). • Progress: analysts have buzzword -- EII. • Challenges: • Integration with EAI? • Yet another middleware? • Horizontal vs. vertical? BTW 2003
Outline • Recent progress • Mediation languages • Adaptive Query processing • XML data management • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. BTW 2003
Peer Data-Management • PDMS: a network of peers • Peers can: • Export base data • Provide views on base data • Serve as logical mediators for other peers • A peer can be both a server and a client. • Semantic relationships are specified locally(between small sets of peers). BTW 2003
Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Network of Mappings (Piazza) CiteSeer UW Stanford GAV, LAV GLAV DBLP Leipzig Saarbruecken Berlin
Advantages of PDMS • No need for a central mediated schema. • Can map data opportunistically, as is most convenient. • Queries are posed using the peer’s schema. Answers come from anywhere in the system. • Semantic Web. • This is not P2P file sharing. • Data has rich semantics • Membership is not as dynamic. BTW 2003
Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Schema Mediation When can LAV and GAV be combined to form such a network structure? [ICDE-03], [WWW-03 for XML] CiteSeer UW Stanford GAV, LAV GLAV DBLP Leipzig Saarbruecken Berlin
Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Query Optimization • Problems: • redundant paths • expensive reformulation. CiteSeer UW Stanford • Possible solution: • Pre-compose some paths DBLP Leipzig Saarbruecken Berlin
Mapping Composition • Incredibly subtle! [w/ Madhavan] • In general, composition can be an infinite set of GLAV formulas. • Results: • Finite in many cases • Even when infinite, often has finite, useful encoding. • Hence, compositions can usually be pre-optimized. BTW 2003
Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Management of Updates[w/ Mork, Gribble] • Problem: when updates are generated, we don’t know who will use them. • Solution: • represent updates as first-class citizens • Complement with boosters • Rules for usage. CiteSeer UW Stanford DBLP Leipzig Saarbruecken Berlin
Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Other Research Issues Intelligent data placement Management of mapping networks Improving networks: finding additional connections. Indexing of views CiteSeer UW Stanford DBLP Leipzig Saarbruecken Berlin
Schema Matching/Mapping • Given • S1 and S2: a pair of schemas/DTDs/ontologies,… • Possibly, data accompanying instances • Additional domain knowledge • Find: • A match between S1 and S2 • A set of correspondences between the terms. • Ultimately, a mapping • Should enable translating data between the schemas. BTW 2003
Example: House Listings house address Water view num-baths LakeMountains ? 1-1 mapping non 1-1 mapping house location view full-baths half-baths front back
Motivations • Heart of any data sharing architecture • Virtual, warehouse, messaging, • web services, semantic web • Translation of legacy data, EAI, … • Key operator in model management • Algebra for manipulating models of data • See [Bernstein, CIDR-03], Melnik et al. [SIGMOD 03]. • Currently, a bottleneck. Done mostly by hand. BTW 2003
Approaches to Matching • Matching is hard because schema does not fully capture the semantics. • Many techniques proposed. They consider similarities in: • Attribute names (synonyms) • Data values, data types • Relationships between columns • Structural similarities • Anything a human expert would try! • Hence, let’s try to simulate a human. BTW 2003
Philosophy of Solutions • Effective schema matching requires a principled combination of techniques. • Like human experts, the matcher should improve over time • Learn from seeing many schemas, matches. • LSD [Doan, Ph.D 2002, U. of Illinois] • COMA [Do et al.] BTW 2003
Corpus Based Solution[Madhavan, Bernstein, Chen, Halevy, Shenoy] • Collect a corpus of schemas and matches. • Learn from the corpus: • Create a classifier for every corpus element • Use multi-strategy learning. • Given S1 and S2 : • Compare each schema element to corpus elements. • If two elements’ similarity vectors are close, then maybe they match each other. BTW 2003
Finding Different Matches BTW 2003
Other Corpus Based Tools • Conjecture: a corpus of schemas can be the basis for many useful tools. • Auto-complete: • I start creating a schema (or show sample data), and the tool suggests a completion. • Query reformulation: • I ask a query using my terminology, and it gets reformulated appropriately. • Improving structured queries over structured web sites (and focused crawling, a la BINGO!) BTW 2003
The Corpus • Contents: • Schemas, ontologies, meta-data, data, queries. • Sample statistics: • How often does a word appear as a relation name? • When it does, what tend to be the attribute names? • What other tables are there? What are the foreign keys? BTW 2003
schema mapping Conclusion: Crossing the Structure Chasm • Data authoring, querying and sharing is everywhere; done by novices too. • Semantic web: the extreme example. Corpus Of schemas BTW 2003
Some References • www.cs.washington.edu/homes/alon • Piazza: WebDB01, ICDE03, WWW03 • The Structure Chasm: CIDR-03 • Mediation surveys: VLDB Journal 01 • Lenzerini, PODS 02 tutorial. • Schema matching: • Rahm and Bernstein, VLDB Journal 01. BTW 2003