1 / 36

On Schema Matching with Opaque Column Names and Data Values

On Schema Matching with Opaque Column Names and Data Values. Jaewoo Kang NC State (Aug 2003) Jeffrey F. Naughton Univ. of Wisconsin-Madison. What is Schema Matching?. Finding semantic correspondences of schema elements across heterogeneous sources. Old problem yet attracting new interests.

Download Presentation

On Schema Matching with Opaque Column Names and Data Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Schema Matching with Opaque Column Names and Data Values Jaewoo Kang NC State (Aug 2003) Jeffrey F. Naughton Univ. of Wisconsin-Madison

  2. What is Schema Matching? • Finding semantic correspondences of schema elements across heterogeneous sources. • Old problem yet attracting new interests. SIGMOD 2003

  3. What is Schema Matching? (Cont’d) • Important for enterprise applications • Data warehouses, data migration. • Also important for Internet data • Virtual databases, web information systems. • Fundamental element of data integration. SIGMOD 2003

  4. No Silver Bullet! • State of the art: A collection of techniques that propose matches. • We have added a new technique to this collection that works when previous techniques don’t even apply. SIGMOD 2003

  5. Some Previous Approaches • Schema-based approaches Site 1 Site 2 SIGMOD 2003

  6. Some Previous Approaches II • Instance-based approaches Site 1 Site 2 SIGMOD 2003

  7. So two previous approaches • Schema-based (interpret column names) • Instance-based (interpret data values) SIGMOD 2003

  8. But what about this problem? Site 1 Site 2 SIGMOD 2003

  9. This is the “Un-interpreted Matching” Problem. • Focus of this talk • Outline of the remainder of this talk • Formal definition • Terminology • Algorithm • Experimental Results SIGMOD 2003

  10. Un-interpreted Matching M1 = match(R(r1, r2, .., rn), S(s1, s2, .., sm)) M2 = match(R(r1, r2, .., rn), S’(f1(s1), f2(s2), .., fm(sm)) where match = a schema matching algorithm, Mi = {(ri-sj)} : set of matching column pairs, fi = arbitrary one-to-one function. ‘match’ is an un-interpreted matching iff M1=M2 for all fi’s. • Main idea: specific token representing column name and value is not important. SIGMOD 2003

  11. Motivating example Two Car Part Tables SIGMOD 2003

  12. Motivating example Two Car Part Tables SIGMOD 2003

  13. Motivating example Two Car Part Tables SIGMOD 2003

  14. Motivating example Two Car Part Tables SIGMOD 2003

  15. Background • Before introducing our algorithm, need: • Information Entropy • Mutual Information • Modeling Dependency Relations • Graph Matching SIGMOD 2003

  16. Information Entropy • Measures the uncertainty of values in an attribute • Standard information theoretic measure SIGMOD 2003

  17. Mutual Information • Another standard information theoretic measure • Measures the amount of information captured in one attribute about the other. • Note Self-information MI(X;X) = H(X) SIGMOD 2003

  18. Modeling Dependency Relation Table R G=Table2DepGraph(R) SIGMOD 2003

  19. Graph Matching • Our algorithm will use graph matching. • {(G1(a),G2(b))}=GraphMatch(G1,G2) • Finds a mapping that minimizes the distance between the two graphs. G1 G2 SIGMOD 2003

  20. Distance Between the Graphs • Euclidean distance metric (Frobenius norm) • where aij and bij = mutual information between node i and j. • m(node in A) = matching node in B. SIGMOD 2003

  21. Measuring the quality of match results SIGMOD 2003

  22. Finally, Our Matching Algorithm 1. G1 = Table2DepGraph(S1); G2 = Table2DepGraph(S2); 2. {(G1(a), G2(b))} = GraphMatch(G1, G2); where Si = an input table, Gi = a dependency graph, (G1(a), G2(b)) = a matching node pair. SIGMOD 2003

  23. Validating the Framework • Graph matching algorithm • Used exhaustive search w/ simple filtering. • Can be replaced w/ approximate algorithms in practice. • System • Java HotSpot VM 1.4 SIGMOD 2003

  24. Goals of experiments… • Main goal: see if mutual information-based un-interpreted matching works. • Secondary goal: see if mutual information is necessary, or if a simpler approach, Entropy-only Matching, works just as well. • Only compares the entropies of attributes in isolation, without considering mutual information. SIGMOD 2003

  25. Data Set I • Census Data (U.S. Census Bureau) • State census data files: NY and CA. • Can algorithm find mapping between attributes in NY and CA tables? SIGMOD 2003

  26. Data Set II • Medical Data • Thrombosis lab exam data (12 years of patient records.) • Range partitioned into two tables based on exam dates. • Can algorithm find mapping between attributes in resulting two tables? SIGMOD 2003

  27. Results Census data Thrombosis exam • Match precision deteriorates as the size of match increases. • However, deterioration is small compared to the exponential increase in search space. • MI-based approach dominates entropy-only approach. SIGMOD 2003

  28. Why does mutual information-based approach dominate entropy-only approach? SIGMOD 2003

  29. Cardinality Constraints in Schema Matching • One-to-one mapping (bijective) A B C A B C G1 G2 SIGMOD 2003

  30. Cardinality Constraints in Schema Matching • One-to-one mapping (bijective) • Onto mapping (surjective) A B C D A B C G1 G2 SIGMOD 2003

  31. Cardinality Constraints in Schema Matching • One-to-one mapping • Onto mapping • Partial mapping A B C E A B C D G1 G2 SIGMOD 2003

  32. What about schemas that don’t match? • Examined how our matching algorithm reacts to the matching of unrelated schemas. (NY-CA vs. Lab1-CA) SIGMOD 2003

  33. Distinguishing Good and Bad Matches • Clearly detects case where there is no good matching. SIGMOD 2003

  34. Summary • Identified new class of schema matching problems that have not been addressed by existing solutions. • First to introduce an un-interpreted matching technique that addresses the new class of problems. • Evaluation suggests it may be useful as an addition to existing matching techniques. SIGMOD 2003

  35. Future Work • Find an efficient, accurate graph matching approximation algorithm. • Extend the techniques to nested structures such as XML, OO schemas. • See if the technique applicable to the problems of schema classification / clustering. SIGMOD 2003

  36. Questions? • For more information: • jaewoo@cs.wisc.edu • http://www.cs.wisc.edu/~jaewoo SIGMOD 2003

More Related