1 / 21

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Explore how eTuner optimizes schema matching systems using synthetic scenarios for improved accuracy and efficiency in various contexts. Learn about the challenges, solutions, and benefits of tuning matching components.

jmerritt
Download Presentation

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. eTuner: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA

  2. Main Points • Tuning matching systems: long standing problem • becomes increasingly worse • We propose a principled solution • exploits synthetic input/output pairs • promising, though much work remains • Idea applicable to other contexts

  3. Schema Matching price agent-name address 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY Schema 1 1-1 match complex match listed-price contact-name city state Schema 2 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

  4. Schema Matching is Ubiquitous • Databases • data integration, • model management • data translation, • collaborative data sharing • keyword querying, schema/view integration • data warehousing, peer data management, … • AI • knowledge bases, ontology merging, information gathering agents, ... • Web • e-commerce, Deep Web, Semantic Web • eGovernment, bio-informatics, scientific data management

  5. Current State of Affairs • Finding semantic mappings is now a key bottleneck! • largely done by hand, labor intensive & error prone • Numerous matching techniques have been developed • Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ... • AI: Stanford, Karlsruhe University, NEC Japan, ... • Techniques are often synergistic, leading to multi-component matching architectures • each component employs a particular technique • final predictions combine those of the components

  6. An Example: LSD [SIGMOD-01] agent name Schema 1 address agent-name 0.5 Name Matcher contact agent Urbana, IL James Smith Seattle, WA Mike Doan Combiner Schema 2 0.1 0.3 Naive Bayes Matcher area contact-agent Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243 area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Constraint Enforcer Match Selector area = address contact-agent = agent-phone ... comments = desc Only one attribute of Schema 2 matches address

  7. Match selector Match selector Match selector Constraint enforcer Constraint enforcer Constraint enforcer Combiner Matcher Combiner Combiner Matcher 1 Matcher n … Multi-Component Matching Solutions • Such systems are very powerful ... • maximize accuracy; highly customizable to individual domain • ... but place a serious tuning burden on domain users • Developed in many recent works • e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 • Now commonly adopted, with industrial-strength systems • e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig] Match selector Combiner Matcher … Matcher 1 Matcher n … Matcher 1 Matcher n LSD COMA SF LSD-SF

  8. Match selector Constraint enforcer Combiner … Matcher 1 Matcher n Tuning Schema Matching Systems • Given a particular matching situation • how to select the right components? • how to adjust the multitude of knobs? Knobs of decision tree matcher Bipartite graph selector Threshold selector • Characteristics of attr. A* search enforcer Relax. labeler ILP •Split measure Average combiner Min combiner Max combiner Weighted sum combiner •Post-prune? •Size of validation set • q-gram name matcher Decision tree matcher Naïve Bays matcher • • TF/IDF name matcher SVM matcher Execution graph Library of matching components • Untuned versions produce inferior accuracy, however ...

  9. ... Tuning is Extremely Difficult • Large number of knobs • e.g., 8-29 in our experiments • Wide variety of techniques • database, machine learning, IR, information theory, etc. • Complex interaction among components • Not clear how to compare the quality of knob configs • Matching systems are still tuned manually, by trial and error • Multiple component systems make tuning even worse Developing efficient tuning techniques is crucial to making matching systems attractive in practice

  10. The eTuner Solution • Given schema S & matching system M • tunes M to maximize average accuracy of matching S with future schemas • incurs virtually no cost to user • Key challenge 1: Evaluation • must search for “best” knob config • how to compute the quality of any knob config C? • if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C • but often have no such W • Key challenge 2: Search • how to efficiently evaluate the huge space of knob configs?

  11. Key Idea: Generate Synthetic Input/Output Pairs • Need workload W = {(S,T1), (S,T2), …, (S,Tn)} • To generate W • start with S • perturb S to generate T1 • perturb S to generate T2 • etc. • Know the perturbation => know matches between S & Ti

  12. 3 3 3 3 12 12 12 12 Schema S 1 3 2 Key Idea: Generate Synthetic Input/Output Pairs V V1 1 Perturb # of tables 3 2 Perturb # of columnsin each table . Split S into V and U with disjoint data tuples . . EMPLOYEES Vn Perturb column and table names EMPLOYEES Perturb data tuples in each table U EMPS 1 3 2 EMPLOYEES EMPS EMPLOYEES EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) Ω1: a set of semantic matches V1 U

  13. Examples of Perturbation Rules • Number of tables • merge two tables based on a join path • splits a table into two • Structure of table • merges two columns • e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) • drop a column • swap location of two columns • Names of tables/columns • rules capture common name transformations • abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc • Data values • rules capture common format transformations: 12/4 => Dec 4 • values are changed based on some distributions (e.g., Gaussian) See paper for details

  14. The eTuner Architecture Tuning Procedures Perturbation Rules Workload Generator Staged Tuner Synthetic Workload Tuned Matching Tool M UΩ1 V1 UΩ2 V2 UΩn Vn Matching Tool M Schema S (Optional)

  15. Match selector Constraint enforcer Combiner … Matcher 1 Matcher n The Staged Tuner • Tune sequentially starting with lowest-level components • Assume • execution graph has k levels, m nodes per level • each node can be assigned one of n components • each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs Level 4 Level 3 Tuning direction Level 2 Level 1

  16. Empirical Evaluation Domains Matchingsystems

  17. Matching Accuracy Domain-dependent eTUNER: Automatic Off-the-shelf Source-dependent eTUNER: Human-assisted Domain-independent 0.9 0.9 0.8 COMA LSD 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course 0.9 0.9 SF 0.8 LSD-SF 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course eTuner achieves higher accuracy than current best methods, at virtually no cost to the user

  18. Cost of Using eTuner • You have a schema S and a matching system M • Vendor supplies eTuner • will hook it up with matching system M • Vendor supplies a matching system M • bundles eTuner inside

  19. Inventory Domain Real Estate Domain Average Sensitivity Analysis • Adding perturbation rules • Exploiting prior match results (enriching the workload) 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 0.5 Accuracy (F1) 0.4 0.3 0.3 Tuned LSD 0.2 0.2 0.1 0.1 0.0 0.0 1 10 20 25 40 50 0 22 44 66 88 Schemas in Synthetic Workload (#) Previous matches in collection (%)

  20. Summary: The eTuner Project @ Illinois • Tuning matching systems is crucial • long standing problem, is getting worse • a next logical step in schema matching research • Provides an automatic & principled solution • generates a synthetic workload, employs it to tune efficiently • incurs virtually no cost to human users • exploits user assistance whenever available • Extensive experiments over 4 domains with 4 systems • Future directions • find optimal synthetic workload • apply to other matching scenarios • adapt ideas to scenarios beyond schema matching (see 3rd speaker)

  21. Backup: User Assistance • S(phone1,phone2,…) • Generate V by dropping phone2: V(phone1,…) • Rename phone1 in V: V(x,…) • Problem: • x matches phone1, x does not match phone2 • User: • group phone1 and phone2 • so if x matches phone1, it will also match phone2 • Intuition: tell system do not bother to try distinguish phone1 and phone2

More Related