1 / 35

iMAP: Discovering Complex Semantic Matches between Database Schemas  

iMAP: Discovering Complex Semantic Matches between Database Schemas  . ITCS6010 Fall 2008 Anuradha Venkataraman 800556407. Introduction. Semantic Mapping : specifies how data in two different data sources are related and how to transform data from one data source to another

art
Download Presentation

iMAP: Discovering Complex Semantic Matches between Database Schemas  

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. iMAP: Discovering Complex Semantic Matches between Database Schemas   ITCS6010 Fall 2008 Anuradha Venkataraman 800556407

  2. Introduction • Semantic Mapping : specifies how data in two different data sources are related and how to transform data from one data source to another • It is a 2 step process, • find semantic matches between attributes of two schemas • location=concat(city,state) • find mappings to transform data in one schema to another based on the matches found(SQL queries, Clio)  • location=select concat(city,state) from S2

  3. Motivation • Semantic mapping is a fundamental task in DI or other data sharing systems like DW, EDI , co operations and collaborations • At the time the paper was written semantic mapping was done through a manual process. This was time consuming and also prone to error. • Majority of prior related work focused on creating 1 - 1 matches • Complex matches are common in real-world schemas 

  4. Matching • 1-1 matches - an attribute in the source schema is matched to another attribute in the target schema • name=custname • Complex matching - specifies a combination of attributes in the source schema that relate to a combination of attributes in the target schema • name=concat(fname,lname) • Finding complex matches is more difficult because the search space is unbounded. 

  5. Problem Definition  • Schema matching in relational schemas - idea can be used for other types as well • Goal to provide an interactive design environment where a human can create mappings between schemas quickly by using the system’s suggestions. • Create a semi-automatic system to find complex matches between two schemas. • Find complex matches. • Use domain knowledge, external data and overlap data 

  6. Overview of the iMap system  • Match Generation by using search techniques over the solution space • Uses a number of searchers(search modules), each searcher searches a subset of the search space. • Text searcher – searches matches for textual target attributes, uses concatenation operator • Beam search is deployed to control the search through the search space • Machine learning, statistics and heuristics are used to evaluate candidate matches. • Diminishing returns principle is used to determine when the search needs to be terminated • Considers name similarity, domain knowledge, integrity constraints, external data and overlap data to re rank and select the best matches • Provides an explanation module to provide the user with reasons for taking various decisions at various stages

  7. Key Contributions  • Semi-automatically discovering complex matches that combines search through a set of candidate matches and methods for evaluating each match • Uses of new kinds of domain knowledge (overlap data and mining external data) • A mechanism for explaining the decisions made by the matching system 

  8. The iMap architecture 1-1 and complex matches User Domain knowledge and data Match selector Explanation module Similarity matrix Similarity estimator Match candidates Match generator Target schema T and source schema S

  9. The iMap architecture • Match Generator • 2 input schemas( S - source schema, T - target schema) • For each attribute t in T, generates possible match candidates by employing various searchers  • Similarity Estimator • Computes similarity score for the match candidates indicating the level of similarity to target attribute t • Outputs similarity score matrix  • Match Selector • Selects the best matches based on the similarity score and domain constraints

  10. The iMap architecture  • The three modules use external data and domain knowledge and overlap data during the processing to improve accuracy and efficiency. • The modules also interact with an explanation module to provide explanations to various actions that the module performs 

  11. Finding Candidate Matches • Search over the search space of possible matches • Search space is extremely large • Uses multi-searcher strategy • Each searcher has a specific purpose • Each searcher searches only a subset of the search space • Advantages: • Makes the system extensible – new searchers can be added and integrated • Each searcher considers small , relevant portions of the search space

  12. Internals of a Searcher • Search Strategy: • Uses beam search to control the search process • Evaluates each candidate based on a scoring function and only retains k – top matches at each level. • Match Evaluation • Assigns score based on semantic distance between the candidate match and target attribute. • Uses machine learning, statistics, heuristics. • Termination • Uses diminishing returns principle • Stops when the difference between the best matches in consecutive levels of iterations is less than a threshold.

  13. Numeric Searcher • Target attributes that are numeric • price, total, age • Restricted to common operations such as add, subtract, divide, multiply • total=qty*unitprice • Considers similarity between the value distributions of target attribute and the complex match candidate using the Kullback-Leibler divergence measure. • This method calculates the divergence between two distributions based on information and probabilistic theory.

  14. Text Searcher • Examines concatenation of text attributes in S • location=concat(city,state,country) • Uses Naïve Bayes Text Classifier to calculate score • This classifier is trained based on the data in the target schema to learn the target attribute, returns probability value for each data instance from source mapping, the average is the score of the mapping. • Starts with 1-1 matches and proceeds with concatenation using beam search strategy

  15. Category Searcher • Target attribute t is categorical if the data instances have less than x distinct values • product-category=product-type • Finds conversions between categorical attributes • waterfront=f(near-water) • Uses KL divergence measures to determine similarity

  16. Schema Mismatch Searcher • Relates data of a schema to the schema of the other • Focuses on binary attributes in target schema(yes/no) • mp3support = {yes/no} • Identifies binary attributes in T and searches for the attribute name in the data of S • description =“…plays mp3,wav…” • Transforms source attribute s into a categorical attribute of t if it contains more than x instances of the attribute name of t. • mp3support = yes if description contains mp3 no if description does not contain mp3

  17. Unit Conversion Searcher • Finds matches that are a conversion between two different types of units • length-mm=10 * size-cm • Determines physical attributes by looking for the presence of units in the data or in the attribute name

  18. Date Searcher • Finds complex matches for date attributes • Bdate=bday/bmonth/byear • Uses ontology to identify date or part of date attributes • Bdate is ontology Date;bday- ontology day, etc. • Uses ontology to identify relationships to determine the type of conversion.

  19. Similarity Evaluator • Employs scoring techniques that can not be used in the searchers due to efficiency reasons. • Uses two modules: • Name base evaluator – similarity between the names(concatenation of complex match) and target attribute • A Naïve Bayes evaluator – based on a naïve based classifier

  20. Match Selector • Selects the best matches based on the scores and other domain integrity constraints . • Name does not contain numbers • Match name=login_name{avenkat5} is ranked lower • Uses domain knowledge to clean up complex matches • Candidate match - total = price*(qty – product_id), pid=product_id • Domain constraint – total and pid are not related. • iMap drops product_id to get total=price*qty

  21. Using Domain Knowledge • Directs search process and prunes meaningless candidates early • Knowledge – fname and location are not related; searcher does not consider complex matches that have both attributes together • Uses domain knowledge as early as possible to prune the search space • Types of Domain Knowledge used • Domain Constraints – present in schema or provided by user. • Used in various phases based on type of constraint • average of total number of rooms is less that 10, all matches that evaluate to more than 10 are ignored. This decision might be postponed to a later a stage if evaluation is time consuming. • Past Complex Matches – Uses knowledge gained from previous matching of schemas in similar domains. • Extracts expression template for complex operations which are used by the searchers • Previous match : cost=price(1 + .50); template – attribute1=variable(1+constant), matcher looks for similar expressions in complex mapping. • External Data – mines properties of attributes from external sources. These properties are used as domain constraints in various stages. • External data source generally provided by user( domain expert ). • Information from mined data – average unit cost does not exceed 1000$.

  22. Overlap Data • In real-world schemas source and target schemas mostly share some data . • Common tuples or data that represent same entity. • This overlap data provides important information on the mappings of the attributes which can be used. • Special overlap searchers are used when there is overlap data in source and target schemas.

  23. Overlap Text Searcher • Used instead of text searcher • Uses overlap data to revaluate the mappings generated by the text searcher. • New score based on the fraction of overlap data that the mapping satisfies. • Overlap data – John Smith = concat(John,Smith) • Mapping name=concat(fname,city) does not match overlap data – score 0; name=concat(fname,lname) matches overlap data – score 1. • Overlap Numeric Searcher • Used instead of numeric searcher • Uses equation discovery system to find the best arithmetic expression that matches attribute t. • Overlap category and schema mismatch Searcher • Similar technique to Overlap text searcher

  24. Generating Explanations • Explanations help users better understand the system. • The system uses complex processes and hence there is a need to explain decisions to the user. • This helps the user guide the system to find the correct matches • Type of Questions a user can ask: • Explain a match – why numrooms=baths+beds+floor generated • Explain why a match is not present – why price=listprice+agent_fee not generated • Explain the ranking between matches – why price=listprice better than price=listprice+agent_fee • Questions can be asked to any specific module

  25. Explanation Module • Uses a dependency graph that is created as the process flows through each module • Records matches, assumptions, data • Nodes: attributes, assumptions, candidate matches, domain knowledge. Connected by directed edge labeled with the module that was responsible for the decision .

  26. Explanation - example

  27. Experiments and Evaluation • Domains and Data • Cricket, Inventory, Financial, Real estate • Collected from various sources • Use both schemas with overlap data as well as disjoint schemas • Performance Measure • Top 1 matching accuracy - % of matches where the best match for a attribute is the correct match • Top 3 matching accuracy - % of matches where the top 3 ranked matches for a attribute contains the correct match – more relevant as the tool is used only to suggest a rank list of matches.

  28. 1-1 Mapping Accuracy • Overlap Data • Default iMap – 58-74 % accuracy • Exploiting domain knowledge – 68-92 % • Disjoint Data • Default iMap – 55-76 % accuracy • Exploiting domain knowledge – 62-79 % • Top – 1 Accuracy – 62-92% • Top – 3 Accuracy – 64 –95% • Exploiting domain knowledge and presence of overlap data increases accuracy

  29. Complex Matching Accuracy • Default System – 33-55% • Exploiting domain constraint and overlap data – 50 -86% • Top-1 Accuracy – • Disjoint data – 27 – 58 % • Overlap Data – 70-85% • Top – 3 Accuracy • Disjoint data – 43 – 92 % • Overlap Data – 43 – 92% • Exploiting domain constraints and overlap data is significant • High accuracy in Top -3 Matches for complex matches

  30. Why 100 % is not acheived • Top - 3 accuracy is better than Top – 1 • Removing small attributes that act as noise in a complex match is difficult • Phone=concat(id,areacode,number), id is a small value that acts as noise • Difficult to find small attributes that form parts of a complex match • Address=concat(street,city,state) , apt# is a small value that is missed

  31. Related Work • L. Xu and D. Embley. Using domain ontologies to discover direct and indirect matches for schema elements. In Proc.of the Semantic Integration Workshop at ISWC-2003. • Uses domain ontology to find relationships and mappings between two schemas • Can be useful in some contexts such as the date searcher. • More such modules can be added • Clio – a complimentary work that uses schema mappings to provide an interactive system that generates transformations from one schema to another.

  32. Conclusion • Finding complex matches is a tough task • iMap identifies both 1-1 and complex matches • Uses search modules to generate matches • Exploits domain knowledge and overlap data to improve accuracy • Provides a explanation feature to better assist the user • The accuracy levels obtained from experimentation shows that on further fine tuning and improvisation it can be even better

  33. Future Work • Handle other types of data sources other than relational data sources • Provide better user interaction style • Integrate more ontology based searchers

  34. Discussion • How is overlap data identified? • How are the domain constraints taken as input • How does the user interact with the system? • How can the accuracy be improved? • Can the system learn from past experiences like user suggestions (not just past matches)? • Is it scalable? Can it handle multiple schema matching (like DI systems where number of schemas are matched with a mediated schema)? • What other search modules do you think need to be integrated? • How efficient is it when handling complex schemas (the schemas experimented which are rather small)? • Can it be extended to find complex matches over multiple tables more accurately?

  35. Thank You

More Related