340 likes | 600 Views
A survey of approaches to automatic schema matching. Sushant Vemparala Gaurang Telang. Motivating Example. Assume UTA needs to integrate 40 databases from its different schools with a total of 27,000 elements. It would take approximately 12 person years to integrate them if done manually.
E N D
A survey of approaches to automatic schema matching SushantVemparalaGaurangTelang
Motivating Example • Assume UTA needs to integrate 40 databases from its different schools with a total of 27,000 elements. • It would take approximately 12 person years to integrate them if done manually. • How would you reduce the manual burden ?
Schema Matching Schema 1 Schema 2 <Schema name="Schema T“> <ElementType name="Customer"> <element type="FName"/> <element type="LName"/> <element type="CAddress"/> </ElementType> <ElementType name="CAddress"> <element type="street"/> <element type="city"/> <element type="provine"/> <element type="code"/> </ElementType> </Schema> <Schema name="Schema S"> <ElementType name="AccountOwner"> <element type="Name"/> <element type="Address"/> <element type="BirthDate"/> </ElementType> <ElementTypename="Address"> <element type="street"/> <element type="city"/> <element type="state"/> <element type="ZIP"/> </ElementType> </Schema>
Schema Matching Definition Schema matching is defined as the task of finding the semantic correspondences between elements of two schemas. S1 Match Match Result S2 Auxiliary information ( User feedback, Dictionaries, Previous mappings)
Application Domains • Schema integration Developing global view over set of independently developed schemas • Comparing data schemes: • Items from different shopping sites • Merger between two corporations • Preparation of data for data warehousing and analyzing processes Any other examples?
High Level Architecture of Generic Match http://db18.informatik.uni-leipzig.de:8080/WebEdition/
Classification of Schema Matching Approaches 1) Schema Level Matching Granularity of Schema Level • Element Level • Structural Level 2) Instance level Matching 3) Hybrid and composite Matching
Schema Level Matching • Only Schema level information(No data content) • Properties? (Name, description, data type ,is-a /part-of relationship, constraints and structure) • Match will find match candidates (each having similarity value)
Granularity: Element Level • For each element of Source Schema determine matching elements in Target Schema • Element Level • atomic level (Attributes in XML schema) • higher level (Columns in Relational tables) Eg: Address = CustomerAddress
Granularity: Structure-Level • Structure-Level: Matches combinations of elements that appear together in S1 with “combinations” of elements that appear together in S2. • Full Structure Match vs Partial Structure Match
Granularity: Structure-Level (Contd) • Equivalence Patterns: Can enhance structure matching by considering known equivalence patterns stored in a library.
Matching Cardinality • One or more S1 elements can match one or more S2 elements. • 1:1, 1:n, n:1, (m:n) 1:1 n:1 1:n m:n
Instance Level Matching • Insight into the contents and meaning of schema elements • Useful when schema information is limited and when semi-structured data is used • Incorrect interpretation of schema level information can be corrected Eg : X is match candidate for CompanyName and Manufacturer
Techniques for Schema Level Matching • Linguistic approaches Name based (equality of names) • equality of canonical name (Cust# = CustNo) • equality of synonyms (make = brand) • equality of hypernyms (book is-a publication & article is-a publication implies book =article)
Techniques for Schema level Matching Name Matching (Contd) • Similarity based on pronunciation or soundex (ship2=ShipTo) • user-provided name matches (issue=bug) • Not limited to 1:1 matches (phone = {homePhone, officePhone} ) • Context based :Payroll application(salary=income) vs Tax reporting application(salary!=income)
Techniques for Schema Level Matching • Description based Eg: Comments in schema elements
Techniques for Schema Level Matching • Constraint based Mapping - Eg:data types and value ranges, optionality, relationship types, cardinalities, etc. - Combined with other matchers to limit match candidates
Techniques for Schema Level Matching • Reusing Schema and Mapping Information -Idea: schemas from same domains are often very similar eg address fields and name fields repeated -Create schema library and schema editors should access library ( Analogy: XML namespaces) S->S2(known) Goal:S1->S? S1->S2?(easy to find)
Techniques for Instance Level • IR techniques (Measures such as Jacard coefficient) • Constraint-based Characterization (EmpNo range vs Dept No range) • Auxiliary Information • Learning (Eg :Evaluate S1 contents Characterization 1, Evaluate S2 contents against Characterization 1 ) Drawback of Instance based?
Combining Matcher: Hybrid Matcher • Integrates multiple matching criteria Eg:-A Matcher with Name matching and constraint based matching • Single Pass • Matching criteria is hard-wired
Combining Matcher: Composite Matcher • Combine the result of several independently executed Matchers • Iterative (Match result of 1st Matcher is consumed by the 2nd Matcher) • Flexible ordering Which is efficient –Hybrid and Composite?
How good is a Match? • Assessing match quality is difficult • Human verification and tuning of matching is often required • A useful metric would be to measure the amount of human work required to reach the perfect match Recall: how many good matches did we show? Precision: how many of the matches we show are good?
Current Work • LSD • SKAT • Similarity Flooding
LSD(Learning Source Description) • Produces 1:1 Instance level Mapping Suppose user wants to integrate 100 data sources • User: • manually creates mappings for a few sources, say 3 • shows LSD these mappings • LSD learns from the mappings • “Multi-strategy” learning incorporates many types of info in a general way • Knowledge of constraints further helps • LSD proposes mappings for remaining 97 sources
LSD: Example Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...
LSD: Training the Learners Mediated schema address price agent-phone description locationlisted-pricephonecomments Schema of realestate.com Name Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description) ... <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </> realestate.com Naive Bayes Learner <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </> (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description) ...
LSD: Applying the Learners Mediated schema Schema of homes.com address price agent-phone description area day-phone extra-info Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Meta-Learner Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) (address,0.6), (description,0.4) <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</>
SKAT(Semantic Knowledge Articulation) • Expert supplies SKAT with few initial rules Ex : 1) Match US.president US.chancellor 2) MisMatch human.nail factory.nail • SKAT articulates on supplied matching rules • Expert approves/rejects. • Creates correct rules and computes an updated articulation (Knowledge gained from irrelevant and rejected rules stored)
Similarity Flooding • Intuition : Whenever any two elements in the graphs G1 and G2 are similar, their neighbors tend to be similar. • Transform schemas into directed labeled graphs
Conclusion • User feedback: • User Interaction: minimize user input but maximize impact of the feedback • If we require user acceptance for our matches, then what happens if our matcher returns thousands or hundreds of matches? • The more configurable the matcher,the better • Problem with Schema representation and Data • Dealing with inconsistent data values for a schema element. • independence of schema representation • Mapping Maintenance: what happens when you map between two schemas and then one changes? • Sophisticated techniques required for n:m matches [Current work based on 1:1]
Conclusion • More attention 1) Re-use opportunities 2) Learning from User feedback Any other issues to address?