1 / 33

Data Integration: A Status Report

Data Integration: A Status Report. Alon Halevy University of Washington, Seattle BTW 2003. Data Integration Report. Recent progress Mediation languages Query processing (XML and other) Commercial Current challenges Flexible architectures: peer-data mgmt.

edward
Download Presentation

Data Integration: A Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Integration:A Status Report Alon Halevy University of Washington, Seattle BTW 2003

  2. Data Integration Report • Recent progress • Mediation languages • Query processing (XML and other) • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. BTW 2003

  3. Data Integration Systems • This is one possible architecture (virtual integration) • Only logical mediated schema is central. Data stays at the sources.

  4. Motivation and Activity • Application areas of data integration: • Enterprise information integration ($$) • The government • Data sources on the web • Scientific data sharing. • Many research projects: • Mine: Information Manifold, Tukwila, LSD. • Companies: • Many startups, big guys getting in. BTW 2003

  5. Outline • Recent progress • Mediation languages • Adaptive Query processing • XML data management • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. • Crossing the Structure Chasm. BTW 2003

  6. Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Mediation Languages Goal: Mediated Schema Language for Specifying Semantic relationships BTW 2003

  7. Source Source Source Source Source Global-as-View (GAV) Create view Actor AS R1 Union Select A,B From S2 Union … Mediated Schema Title, Actor, … R1 R2 R3 R4 R5 BTW 2003

  8. Source Source Source Source Source Local-as-View (LAV) (GLAV) Create View R5 as Select * From Movie Where lang=“German” Create View R1 as Select title, name From Title Join Actor Where Year>1970 Mediated Schema Title, Actor … R1 R2 R3 R4 R5 BTW 2003

  9. Adaptive Query Processing • Problem: no stats, network unstable • Cannot ‘Plan and then execute’ • Need to adapt plan during execution. • Idea already in Ingres (1976) • Proposed before data integration: • Cole and Graefe (choose nodes) • Kabra and Dewitt (mid-query re-opt). BTW 2003

  10. Convergent Query Processing[Zack Ives, Ph.D 2002, U. Penn] • Processor starts with initial plan • Monitors execution, accumulating stats. • Switches plan when a better one found • Reuses intermediate results. • Final, cleanup phase. • Possible transformation types: • Plan partitioning, data partitioning, low-level rescheduling. • Can be aggressive (e.g., with aggregations). BTW 2003

  11. XML Query Processing • XML facilitates integration. • Mediator query processor may manipulate XML directly. • Progress on: • Publishing to XML, XML views on relations • Physical algebras for manipulating XML • Optimization of XQuery. BTW 2003

  12. The Commercial World • Some startups: • Nimble, MetaMatrix, Calixa, Enosys, … • Big guys making announcements: • IBM, BEA, MS, (Oracle still being defiant). • Progress: analysts have buzzword -- EII. • Challenges: • Integration with EAI? • Yet another middleware? • Horizontal vs. vertical? BTW 2003

  13. Outline • Recent progress • Mediation languages • Adaptive Query processing • XML data management • Commercial • Current challenges • Flexible architectures: peer-data mgmt. • Getting to the root of semantic heterogeneity: schema mapping. BTW 2003

  14. Peer Data-Management • PDMS: a network of peers • Peers can: • Export base data • Provide views on base data • Serve as logical mediators for other peers • A peer can be both a server and a client. • Semantic relationships are specified locally(between small sets of peers). BTW 2003

  15. Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Network of Mappings (Piazza) CiteSeer UW Stanford GAV, LAV GLAV DBLP Leipzig Saarbruecken Berlin

  16. Advantages of PDMS • No need for a central mediated schema. • Can map data opportunistically, as is most convenient. • Queries are posed using the peer’s schema. Answers come from anywhere in the system. • Semantic Web. • This is not P2P file sharing. • Data has rich semantics • Membership is not as dynamic. BTW 2003

  17. Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Schema Mediation When can LAV and GAV be combined to form such a network structure? [ICDE-03], [WWW-03 for XML] CiteSeer UW Stanford GAV, LAV GLAV DBLP Leipzig Saarbruecken Berlin

  18. Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Query Optimization • Problems: • redundant paths • expensive reformulation. CiteSeer UW Stanford • Possible solution: • Pre-compose some paths DBLP Leipzig Saarbruecken Berlin

  19. Mapping Composition • Incredibly subtle! [w/ Madhavan] • In general, composition can be an infinite set of GLAV formulas. • Results: • Finite in many cases • Even when infinite, often has finite, useful encoding. • Hence, compositions can usually be pre-optimized. BTW 2003

  20. Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Management of Updates[w/ Mork, Gribble] • Problem: when updates are generated, we don’t know who will use them. • Solution: • represent updates as first-class citizens • Complement with boosters • Rules for usage. CiteSeer UW Stanford DBLP Leipzig Saarbruecken Berlin

  21. Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Other Research Issues Intelligent data placement Management of mapping networks Improving networks: finding additional connections. Indexing of views CiteSeer UW Stanford DBLP Leipzig Saarbruecken Berlin

  22. Schema Matching/Mapping • Given • S1 and S2: a pair of schemas/DTDs/ontologies,… • Possibly, data accompanying instances • Additional domain knowledge • Find: • A match between S1 and S2 • A set of correspondences between the terms. • Ultimately, a mapping • Should enable translating data between the schemas. BTW 2003

  23. Example: House Listings house address Water view num-baths LakeMountains ? 1-1 mapping non 1-1 mapping house location view full-baths half-baths front back

  24. Motivations • Heart of any data sharing architecture • Virtual, warehouse, messaging, • web services, semantic web • Translation of legacy data, EAI, … • Key operator in model management • Algebra for manipulating models of data • See [Bernstein, CIDR-03], Melnik et al. [SIGMOD 03]. • Currently, a bottleneck. Done mostly by hand. BTW 2003

  25. Approaches to Matching • Matching is hard because schema does not fully capture the semantics. • Many techniques proposed. They consider similarities in: • Attribute names (synonyms) • Data values, data types • Relationships between columns • Structural similarities • Anything a human expert would try! • Hence, let’s try to simulate a human. BTW 2003

  26. Philosophy of Solutions • Effective schema matching requires a principled combination of techniques. • Like human experts, the matcher should improve over time • Learn from seeing many schemas, matches. • LSD [Doan, Ph.D 2002, U. of Illinois] • COMA [Do et al.] BTW 2003

  27. Corpus Based Solution[Madhavan, Bernstein, Chen, Halevy, Shenoy] • Collect a corpus of schemas and matches. • Learn from the corpus: • Create a classifier for every corpus element • Use multi-strategy learning. • Given S1 and S2 : • Compare each schema element to corpus elements. • If two elements’ similarity vectors are close, then maybe they match each other. BTW 2003

  28. Learning from Corpus vs. Learning from the schemas BTW 2003

  29. Finding Different Matches BTW 2003

  30. Other Corpus Based Tools • Conjecture: a corpus of schemas can be the basis for many useful tools. • Auto-complete: • I start creating a schema (or show sample data), and the tool suggests a completion. • Query reformulation: • I ask a query using my terminology, and it gets reformulated appropriately. • Improving structured queries over structured web sites (and focused crawling, a la BINGO!) BTW 2003

  31. The Corpus • Contents: • Schemas, ontologies, meta-data, data, queries. • Sample statistics: • How often does a word appear as a relation name? • When it does, what tend to be the attribute names? • What other tables are there? What are the foreign keys? BTW 2003

  32. schema mapping Conclusion: Crossing the Structure Chasm • Data authoring, querying and sharing is everywhere; done by novices too. • Semantic web: the extreme example. Corpus Of schemas BTW 2003

  33. Some References • www.cs.washington.edu/homes/alon • Piazza: WebDB01, ICDE03, WWW03 • The Structure Chasm: CIDR-03 • Mediation surveys: VLDB Journal 01 • Lenzerini, PODS 02 tutorial. • Schema matching: • Rahm and Bernstein, VLDB Journal 01. BTW 2003

More Related