520 likes | 533 Views
Putting Context into Schema Matching. Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University. Wenfei Fan Univ of Edinburgh and Bell Labs Michael Flaster* Google. *Work performed at Lucent Technologies -- Bell Laboratories. Overview. Motivation Background Strawman
E N D
Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael Flaster* Google *Work performed at Lucent Technologies -- Bell Laboratories.
Overview • Motivation • Background • Strawman • Framework • Experimental Evaluation • Related Work • Conclusions Slide 2
.88 .93 .97 Schema Matching vs. Schema Mapping • Arrows inferred based on meta-data or sample instance data • Associated confidence score • Meaning (variant of): RS.Person.City RT.Student.City • Schema Matching means “computer-suggested arrows” RS.Person RT.Student First Name Target Schema: RT Source Schema: RS Last Address City City . . . . . . Slide 3
Schema Mapping: “From Arrows to Queries” • Given a set of arrows user input, produce a query that maps instances of RS into instances of RTRT • Transformations, joins [Miller, Has, Hernandez, VLDB 2002]added by, or with help from, the user • Most of this talk is about matching, some implications for mapping later • select concat(First, “ ”,Last) as Name, • City as City • from RS.Person, RS.Education,… • where … RT.Student RS.Person First Name Last Address City City . . . . . . Q: RS -> RT Slide 4
RT.book RS.inv title: string id: integer isbn: string name: string price: float code: string format: string type: integer instock: string descr: string RT.music arrival: date title: string asin: string price: float sale: float label: string Motivation: inventory mapping example • Consider integrating two inventory schemas • Books, music in separate tables in RT • Run some nice schema match software Slide 5
RT.book RS.inv title: string id: integer isbn: string name: string price: float code: string format: string type: integer where type=1 instock: string descr: string RT.music arrival: date title: string asin: string price: float sale: float label: string where type = 2 Inventory where clause • The lines are helpful (schema matching is a best-effort affair), but… • lines are semantically correct only in the context of a selection condition Slide 6
M M RS.aa RT.bb true RS.aa RT.bb RS.c=3 Definition and Goals • Contextual schema match: An arrow between source and target schema elements, annotated with a selection condition • In a standard schema match, the condition “true” is always used • Goal: Adapt instance-driven schema matching techniques to infer semantically valid contextual schema matches, and create schema maps from those matches Slide 7
where Assgn=2 Name Grade1 Grade2 Grade3 … =3 = … Name Assgn Grade Attribute promotion example • Consider integrating data about grade assignments [Fletcher, Wyss, SIGMOD 2005 demo] • Again context is needed, but semantics are slightly different: attribute promotion where Assgn=1 Joe 1 84 Joe 2 86 Joe 3 75 Mary 1 92 Mary 2 94 Mary 3 85 Slide 8
Overview • Motivation • Background • Strawman • Framework • Experimental Evaluation • Related Work • Conclusion Slide 9
M M RS.ac RT1.bb true RS.ac RT1.ac true Background: Instance-level matching Dubious, at best! Nice match! Sunnyvale Los Angeles Cupertino Gilroy San Diego San Jose Cupertino Palo Alto Gilroy Pleasanton Sunnyvale Sunnyvale Los Angeles Cupertino Gilroy San Diego (408) 123-4456 (212) 223-3455 (408) 123-2222 (408) 324-4444 Slide 10
M M RS.ac RT1.bb true RS.ac RT1.ac true More Whatever String Edit Distance Type Expert Cosine Similarity Bayesian Tri-gram Whatever Background: Instance-level matching Dubious, at best! Perfect match! • Coming up with a good score is far from simple! • Derive comparable scores across sample size, data types, etc. Slide 11
M M M M M M M M M M M M RS.db RT1.ar true RS.ac RT1.ac true RS.ac RT1.ac true RS.ba RT1.cd true RS.ba RT1.sb true RS.ac RT1.vw true RS.ac RT1.vw true RS.bd RT1.ad true RS.ba RT1.cd true RS.db RT1.ar true RS.ba RT1.sb true RS.bd RT1.ad true StandardMatch(RS,RT,) Consider all |RS||RT| matches, score them, normalize the scores Rank by normalized score Apply as a cutoff, and return Slide 12
What attributes are candidates for the where clause? We focus on “categorical” attributes (leaving non-categorical attributes as future work) If not identified by schema, infer from sample data, as any attribute with more than 1 value most values associated with more than one tuple Background: Categorical Attributes RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date Slide 13
Overview • Motivation • Background • Strawman • Framework • Experimental Evaluation • Related Work • Conclusion Slide 14
Strawman Algorithm • Use instance-based matching algorithm to compute a set of matches, L = M1..Mn, along with associated scores • For each Mi in L, of the form (RS.s,RT.t,true) For each categorical attribute c in the source (or target) For each value v taken by c in the sample • Restrict the sample of RS to tuples where c=v • Re-compute the match score on the new sample • For c,v that most improves score, replace Mi with (RS.s,RT.t,c=v) Slide 15
4. Try each context condition M M M M M M RS.ba RT2.sb true RS.db RT2.ar true RS.ac RT1.ac true RS.ac RT1.vw true RS.ba RT1.cd true RS.bd RT1.ad true RS.d = “open” RS.c = 2 RS.t = 0 RS.c = 2 or RS.c = 3 RS.t = 1 5. Evaluate quality of match M RS.ba RT1.cd Rs.t=1 ContextMatch(RS,RT,) StandardMatch… Rank by normalized score Apply as a cutoff, and return 6. Keep the best! Slide 16
Problems with Strawman • False Positives • the increase in the score may not be meaningful, since some random subsets of corpus will match better than the whole (even with size-adjusted metrics) • False Negatives • original matching algorithm only returned matches with quality above some threshold to be in L, but a match that didn’t make the cut may improve greatly with contextual matching • Time • with disjuncts -- too many expressions to test Slide 17
Strawman 2.0 • Like Strawman, but require an improvement threshold, w, to cut down on false positives • Not much better • Setting w is problematic, as matcher scores are not perfect Slide 18
Overview • Motivation • Background • Strawman • Framework • Experimental Evaluation • Related Work • Conclusion Slide 19
RT.book RS.inv title: string id: integer isbn: string name: string price: float code: string format: string type: integer instock: string descr: string RT.music arrival: date title: string asin: string price: float sale: float label: string Our approach: • Pre-filter conditions based on classification • Find conditions that improve several matches from the same table Slide 20
RS.inv where type = 1 RT.book RS.inv id: integer title: string id: integer isbn: string name: string name: string price: float code: string code: string format: string type: integer type: integer instock: string instock: string descr: string descr: string RT.music arrival: date arrival: date title: string asin: string RS.inv where type = 2 price: float sale: float id: integer name: string label: string code: string type: integer instock: string descr: string arrival: date View-oriented contextual mapping (cont’d) Slide 21
Algorithm ContextMatch(RS,RT,) • L = ; • M = StandardMatch(RS,RT,); • C = InferCandidateViews(RS,M,EarlyDisjuncts); • for c C do • Vc = select * from RSwhere c; • for m M do • m’ := m with RS replaced by Vc; • s := ScoreMatch(m’); • L = L {(m’,s)}; • returnSelectContextualMatches(M, L,EarlyDisjuncts) Slide 22
M M M M M M RS.bd RT1.ad true RS.ac RT1.vw true RS.db RT2.ar true RS.ba RT2.sb true RS.ba RT1.cd true RS.ac RT1.ac true RS.d = “open” RS.c = 2 RS.t = 0 RS.c = 2 or RS.c = 3 RS.t = 1 5. Evaluate quality of matches M RS.ba RT1.cd Rs.t=1 ContextMatch(RS,RT,) StandardMatch… InferCandidateViews Rank by normalized score Apply as a cutoff, and return For each candidate view V, 4. Re-compute summaries for V as: “select * from RS where RS.t = 1” Slide 23
How to Filter Candidate Views • Naïve • Any Boolean condition involving a categorical attribute (strawman approach) • SourceClassifier, TargetClassifier • Check for categorical attributes that do a “good job” categorizing other attributes • Disjunct Handling (early or late) • Conjunct Handling Slide 24
RS.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date Source Classifier Intuition • how well do the categorical attributes serve as classifier labels for the other attributes? Slide 25
Source Classifier Intuition: type • how about ‘type’? Slide 26
Source Classifier Intuition: instock • how about ‘instock’? Slide 27
What do we really mean by a “good job”? • Split the sample into a training set and a testing set (randomly) • For each categorical attribute C and non-categorical attribute A • Train a classifier H by treating the value of A as the document and the value of C as the label • Test H against test set, determine precision, p, and recall, r • Score(C) w.r.t. A based on combination of precision and recall (F = 2pr/(p+r)) • Compare Score(C) to Score(NC), wher NC is a Naïve Classifier: • This classifier chooses most frequent label • C does a good job with H if H’s improvement over Naïve is statistically significant with 95% confidence Slide 28
Target Classifier Intuition • Train a new classifier, T, treating each target schema attribute as a class of documents • Check source values against this classifier • Label each value with best guess label • Use labels instead of values in the same framework Book.comment Book.comment Music.label Slide 29
Handling Disjunctive Conditions • Why Disjuncts? What if type field had separate categories for hardback and paperback? • Two approaches to handling disjunctive conditions, “early” and “late” • Early Disjuncts • InferCandidateViews is responsible for identifying “interesting” disjuncts • Each interesting disjunct is evaluated separately, no overlapping conditions are output • Late Disjuncts • InferCandidateViews returns no disjuncts • All high-scoring conditions are unioned together (Clio semantics), effectively creating a disjunct Slide 30
Early Disjuncts: A Heuristic Approach • When evaluating trained classifier on test set for some categorical attribute C, make note of misclassifications of the form “should be A, but guessed B” • Consider merging the (A,B) pair that would repair most errors • by merge, we mean “replace” A and B values with (A,B) • Re-evaluate • Repeat • Keep all alternatives formed this way that score well • Only accept 1 view that mentions attribute C (don’t union) Slide 31
Handling Conjuncts • Proposed Approach: • Assumes that a good conjunctive view has a good disjunctive view as one of the terms in the conjunct. • Run Context Match Repeatedly • At stage i, consider views VC identified by the previous (i-1)th run as the input base tables • where C was the select condition defining the view • When considering candidate attributes for a run, only consider categorical attributes not in C. • (Conjunct handling not in current experiments) Slide 32
Selecting Contextual Matches • Each view V based on condition c is evaluated, rather than each match • Compute overall confidence of matches from V, and compare to overall confidence from base table • If overall confidence is better than w, use V instead of the base table • If more than one qualifies • If EarlyDisunct, choose the best • Else, take all that are over w Slide 33
Comments on Schema Mapping • Seek to apply the Clio ([Popa et al, VLDB 2002]) approach to mapping construction • Create ‘logical tables’ based on key-foreign key constraints • Two challenges • Extend notion of foreign-key constraints in context of selection views, undecidability result • Extend join rules of [Popa et al, VLDB 2002] to handle the selection views • See paper for details Slide 34
Overview • Motivation • Background • Strawman • Framework • Experimental Evaluation • Related Work • Conclusion Slide 35
Experimental Study • Used schemas from the retail domain • schemas created by students at UW • Aaron, Ryan, Barrett • Populated code, descr info by scraping web-sites, used some name data from Illinois Semantic Integration Archive • ItemType is split, so that instead of just CD, BOOK • e.g. CD1, CD2, BOOK1, BOOK2, =4 • Compare matched edges to correct edges • Accuracy: how many of BOOKi edges go to book target table? • Precision: of the BOOKi edges, how many go to book target? • Fmeas: 2(Accuracy * Precision)/(Accuracy + Precision) Slide 36
View improvement threshold: w • How sensitive is technique to w? • Depends on disjunct strategy • Easier to pick w with EarlyDisjunct Aaron Barett Ryan Slide 37
Strawman • Strawman means • Late disjunct (EarlyDisjunct=false) • Pick best arrow from each source attribute on per-attribute basis (MultiTable) Slide 38
Sensitivity to Decoy Categorical Attributes • Add 3 extra categorical attributes • Vary their correlation with ItemType (higher correlation makes it harder) • Naïve is not only slow, it is overly confusing to the quality metrics • EarlyDisjunct heuristic based on classification helps with quality LateDisjunct EarlyDisjunct Slide 39
Varying schema size • Add n non-categorical attributes to every table, all taken from same domain • Add n/4 categorical attributes to tables with categorical attributes • Early dip is before non-categorical attributes match each other Slide 40
Runtime as schema gets larger • Same experiment, compare runtimes • TgtClass is somewhat higher quality (not shown), but takes much longer for large schemas Slide 41
Name Grade1 Grade2 Grade3 … Name Assgn Grade Grades Example • Create an experiment based on grades example • Artificial data • mean of assignment I is 40 + 10(I-1) (as grades improve) • standard deviation is varied where Assgn=2 =3 = … where Assgn=1 Joe 1 Bob 84 Joe 2 Sue 86 Joe 3 75 Mary 1 92 Mary 2 94 Mary 3 85 Slide 42
Overview • Motivation • Background • Strawman • Framework • Experimental Evaluation • Related Work • Conclusion Slide 44
Related Work • Instance level schema matching • Survey [Rahm, Bernstein, VLDB Journal 2001], Coma [Do, Rahm, VLDB02], Coma++ [SIGMOD 05], iMAP [Doan et al, SIGMOD 01], Cupid [Madhavan, Bernstien, Rahm, VLDB 01], etc. • Schema mapping • Clio [Popa, et al, VLDB 02], [Haas et al, SIGMOD 2005], etc • Model Management (many papers) • Overcoming heterogeneity during match process • Schema Mapping as Query Discovery [Miller, Haas, Hernandez, VLDB 2000] - present user with examples to derive join conditions • MIQIS [Fletcher, Wyss, (demo) SIGMOD 2005] - search through a large space of schema transformations (beyond what is given here), but requires the same data to appear in both source and target • We focus on inferring selection views only, but are very compatible with existing schema match work Slide 45
Conclusions • Contributions • Introduced contextual matching as an important extension to schema matching • Defined a general framework in which instance-level match technique is treated as a black box • Identified two techniques based on classification to find good conditions • Identified filtering criterea for contextual matches • Define contextual foreign key and new join rules to extend a Clio-style schema mapper to better handle contextual matches • Experimental study illustrating time/quality tradeoffs • Future Work • More complex view conditioning (horizontal partitioning + attribute promotion) • Consider taking constraints on target into account in quality functions Slide 46
The End Thank you, any questions?
sizes_fmeas.eps Slide 48
Standard Match Algorithm • StandardMatch(RS,RT, ) • Evaluate quality of match between all pairs of (source, target) attributes • Ignore complex (multi-attribute) matches for simplicity • return matches between source table RS and target schema RT that have confidence threshold >= Slide 49
M M M M M M M RS.ac RT1.ac true RS.ba RT1.cd true RS.ba RT1.sb true RS.db RT1.ar true RS.ac RT1.vw true RS.bd RT1.ad true RS.af RT1.ca true Slide 50