630 likes | 720 Views
iMAP: Discovering Complex Semantic Matches Between Database Schemas. Ohad Edry January 2009 Seminar in Databases. Motivation. Consider a union of databases of two banks. We need to generate a mapping between the schemas. Bank A tables. Bank B tables. Introduction.
E N D
iMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases
Motivation • Consider a union of databases of two banks. • We need to generate a mapping between the schemas Bank A tables Bank B tables
Introduction • Semantic mappings specify the relationships between data stored in disparate sources. • A mapping between attribute of target schema to attributes of source schema According to the semantics
Motivation – Example continue Bank A tables Bank B tables
Motivation – Example continue Semantic Mapping! Bank A tables Bank B tables
Introduction • Most of the work in this field focused on Matching Process. • The types of matches can be split to 2: • 1 – 1 matching. • Complex matching – Combination of attributes in one schema corresponds to a combination in other schema • Match Candidate– each matching of attributes from source and target schemas.
Motivation – Example continue Semantic Mapping! Bank A tables Bank B tables 1-1 matching candidate Complex matching candidate
Introduction - examples: • Example 1: • Example 2: Company A Company B
Introduction - examples: • Example 1: • Example 2: Company A Company B
Introduction - examples: • Example 1: • Example 2: Company A Company B
Introduction - examples: • Example 1: 1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. • Example 2: Company A Company B
Introduction - examples: • Example 1: 1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. • Example 2: Company A Company B Product Price = Price*(1-Discount)
Difficulties in Generating Matchings • Difficult to find the matches because • Finding complex matches is not trivial at all • How the system should know: Product Price = Price*(1-Discount) • The number of candidates for Complex Matches is large. • Sometimes tables should be joined: Product Price = Price*(1-Discount)
Main Parts of the iMAP System • Generating Matching Candidates • Pruning matching candidates • By exploiting Domain Knowledge • Explaining Match Predictions • Provides an explanation to selected predicted matching • Causes the system to be semi automatically.
iMAP System Architecture • Consists three main modules: • Match Generator – generates the matching candidates using special searchers for target schema and source schema. • Similarity Estimator – generates matrix that stores the similarity score of pairs (target attribute, match candidate) • Match Selector – examines the score matrix and outputs the best matches under certain conditions.
iMAP System Architecture – cont. Match Selector: receives similarity matrix and output final match candidates Similarity Estimator: receives match candidates and outputs similarity matrix To each attribute t of T iMAP generates match candidates from S
Part 1: Match Generation - searchers • The key in match generation is to SEARCH through the space of possible match candidates. • Search space – all attributes and data in source schemas • Searchers work based on knowledge of operators and attributes types such as numeric, textual and some heuristic methods.
The Internal of Searchers • Search Strategy • Facing the large space using the standard beamsearch. • Match Evaluation • Giving score which approximates the distance between the candidate and the target. • Termination Condition • Search should be stopped because of a large search space.
The Internal of Searchers – Example • i Iterations which limited by k results: MAXi • Product Price = Price*(1-Discount) • Product Price = Product ID • k. … MAXi+1 Stop: MAXi-MAXi+1<delta Return first k candidates
The Internal of Searchers – Join Paths • Find matches in Join Paths in two steps: Company A Company B Product Price = Price*(1-Discount) Second Step –search process use the join paths First Step -Join paths between tables: Join(T1,T2)
Implemented searchers in iMAP • Contains the following searchers: • Text • Numeric • Category • Schema Mismatch • Unit Conversion • Date • Overlap versions of Text, Numeric, Category, Schema Mismatch, Unit Conversion
Implemented searchers – Text Searcher example • Text searcher: Purpose: finds matching candidates that are concatenations of text attributes. Method: • Target attribute: Name • Search Space: attributes in source Schemas which have textual properties • Searcher search in the Search Space attributes or concatenations of attributes
Implemented searchers – Numeric Searcher example • Numeric Searcher : Purpose: best matches for numeric attributes. • Issues: • Compute the similarity score of complex match • Value distribution • Type of matches • +,-,*,/ • 2 Columns dim1*dim2=size
Implemented searchers in iMAP – cont. • Category Searcher: Purpose: find matches between categorical attributes in the source and in the schema. • Schema Mismatch Searcher: Purpose: relating the data of a schema with the schema of the other. Occurs very often. • Unit Conversion Searcher: Purpose: find matches between different types of units. • Date Searcher: Purpose: finds complex matches for date attributes.
Part 2: Similarity Estimator • Receives from the Match Generator candidate matches which based on the score that each searcher assigns. • Problem: each searcher can give different score • Solution: Final score, more accurate, to each match by using additional types of information. • iMAP system uses evaluator modules: • Name-based evaluator – computes score basing on similarity of names • Naive Bayes evaluator Why not to perform this phase during the search phase? Very Expensive!
Module example - Naive Bayes evaluator • Consider the mach agent-address = location • Building model: Data instance in target attribute will be positive otherwise the data will be negative • Naïve Bayes Classifier learn the model • Applied the trained classifier on the source attribute data • Each data instance receive score • Return an average on all score as result
Part 3: Match Selector • Receives from the Similarity Estimator the scored suggested for matching candidates • Problem: These matches may violate certain domain integrity constraints. • For example: mapping 2 source attributes to the same target attributes. • Solution: set of domain constraints • Defined by domain experts or users
Constraint Example • Constraint: Price and Club members price are unrelated • Match Selector delete this match candidate Match Selector receives list of candidates: k. Product Price = Price+club members price
Exploiting Domain Knowledge • iMAP system uses 4 different types of knowledge: • Domain Constraints • Past matches • Overlap data • External data • iMAP uses its knowledge at all levels of the system and early as it can in match generation.
Types of knowledge • Domain constraints • Three cases: • Name and ID are unrelated - Attributes from the Source schema are unrelated • searchers • Account < 10000 - Constraint on single attribute t • Similarity Estimator and Searchers • Account and ID are unrelated - Attributes from the Target Schema are unrelated • Match Selector Source: Target:
Types of knowledge – cont. • Past Complex Matches • Numeric Searcher can use past expression template: • Price=Price*(1-Discount) generates VARIABLE*(1-VARIABLE) • External Data – using external sources for learning about attributes and their data. • Given a target attribute and useful feature of that attribute, iMAP learn about value distribution • Example: number of cities in state
Types of knowledge – cont. • Overlap Data – Provide information for the mapping process. • contains searchers which can exploit overlap data. • Overlap Text, Category & Schema Mismatch searchers • S and T share a state listing • Matches: city=state , country=state • Re-evaluating results: city=state is 0 and country=state is 1 • Overlap Numeric Searcher – using the overlap data and using equation discovery system (LAGRMGE) the best arithmetic expression for t is found.
Generating Explanations • One goal is to provide design environment which the user will inspect the matches predicted by the system, modified them manually and the system will have a feedback. • The system uses complex algorithms so it needs to explain the user the matches. • Explanations are good for the user as well • Correct matches quickly • Tells the system where its mistake.
Generating Explanations – so, what do you want to know about the matches? • iMAP system defines 3 main questions: • Explain the existing match – why a certain match X is presented in the output of iMAP? Why the match survive the all process? • Explain absent match - why a certain match Y is not presented in the output of iMAP? • Explain match ranking – why match X is ranked higher than match Y? • Each of these questions can be asked for each module of iMAP. • Question can be reformulated recursively to underlying components.
Generating Explanations - Example • Suppose we have 2 real-estate schemas: • iMAP produces the ranked matches: • (1) List-price=price*(1+monthly-fee-rate) • (2) List-price=price iMAP explanation: both matches were generated by the numeric searcher and the similarity estimator also agreed to the ranking.
Generating Explanations - Example • Suppose we have 2 real-estate schemas: • The current order: • List-price=price*(1+monthly-fee-rate) • List-price=price • Match selector have 2 constraints: (1) month-posted=month-fee-rate, (2) month-posted and price don’t share common attributes List-price=price match is selected by the match generator
Generating Explanations - Example • Suppose we have 2 real-estate schemas: • The current order: • List-price=price • List-price=price*(1+monthly-fee-rate) • iMAP explains that the source for month-posted=month-fee-rate is the date searcher The user correct the iMAP that month-fee-rate is not type of date.
Generating Explanations - Example • Suppose we have 2 real-estate schemas: • List-price=price*(1+monthly-fee-rate) is again the chosen match • The Final order: • List-price=price*(1+monthly-fee-rate) • List-price=price
Example cont. – generated dependency graph Dependency Graph is small!!! Searchers produce only k best matches iMAP goes through three stages
What do you want to know about the matches? • Why a certain match X is presented in the output of iMAP? • Returns the part in the graph that describes the match.
What do you want to know about the matches? • Why a certain match X is presented in the output of iMAP? • Returns the part in the graph that describes the match. • Why match X is ranked higher than match Y? • Return the comparing part in the graph between the 2 matches.
What do you want to know about the matches? • Why a certain match X is presented in the output of iMAP? • Returns the part in the graph that describes the match. • Why match X is ranked higher than match Y? • Return the comparing part in the graph between the 2 matches. • Why a certain match Y is not presented in the output of iMAP? • If the has been eliminated during the process the part that responsible for the eliminating explains why • Otherwise the iMAP ask the searcher to check if they can generate the match and to explain why it was not generated
Evaluating iMAP on real world domains • iMAP was evaluated on 4 real-word domains: • For the Cricket domain they used 2 independently developed databases • For the other 3 they used one real-world source database and target schema which created by volunteers. • Databases with overlap domains and databases with disjoint domains
Evaluating iMAP on real world domains – cont. • Data Processing: removing data such as “unknown” and adding the most obvious constraints. • Experiments: there are actually 8 experimental domains • 2 domains for each one – overlap domain and disjoint domain. • Performance measure: • 1 matching accuracy • 3 matching accuracy • Complex match • Partial complex match
Results (1) Overall and 1-1 matching accuracy: • Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching (a) Exploiting domain constraints and overlap data improve accuracy (b) Disjoint domains achieves lower accuracy than overlap data domains
Results (2) Complex matching accuracy – Top 1 and Top 3:
Results (2) – Cont. Complex matching accuracy – Top 1: • Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains • (a) Exploiting domain constraints and overlap data improve accuracy • (b) iMAP achieves lower accuracy than in overlap data domains • No overlap data decreases the accuracy of Numeric Searcher and Text Searcher.