210 likes | 225 Views
Explore how eTuner optimizes schema matching systems using synthetic scenarios for improved accuracy and efficiency in various contexts. Learn about the challenges, solutions, and benefits of tuning matching components.
E N D
eTuner: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA
Main Points • Tuning matching systems: long standing problem • becomes increasingly worse • We propose a principled solution • exploits synthetic input/output pairs • promising, though much work remains • Idea applicable to other contexts
Schema Matching price agent-name address 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY Schema 1 1-1 match complex match listed-price contact-name city state Schema 2 320K Jane Brown Seattle WA 240K Mike Smith Miami FL
Schema Matching is Ubiquitous • Databases • data integration, • model management • data translation, • collaborative data sharing • keyword querying, schema/view integration • data warehousing, peer data management, … • AI • knowledge bases, ontology merging, information gathering agents, ... • Web • e-commerce, Deep Web, Semantic Web • eGovernment, bio-informatics, scientific data management
Current State of Affairs • Finding semantic mappings is now a key bottleneck! • largely done by hand, labor intensive & error prone • Numerous matching techniques have been developed • Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ... • AI: Stanford, Karlsruhe University, NEC Japan, ... • Techniques are often synergistic, leading to multi-component matching architectures • each component employs a particular technique • final predictions combine those of the components
An Example: LSD [SIGMOD-01] agent name Schema 1 address agent-name 0.5 Name Matcher contact agent Urbana, IL James Smith Seattle, WA Mike Doan Combiner Schema 2 0.1 0.3 Naive Bayes Matcher area contact-agent Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243 area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Constraint Enforcer Match Selector area = address contact-agent = agent-phone ... comments = desc Only one attribute of Schema 2 matches address
Match selector Match selector Match selector Constraint enforcer Constraint enforcer Constraint enforcer Combiner Matcher Combiner Combiner Matcher 1 Matcher n … Multi-Component Matching Solutions • Such systems are very powerful ... • maximize accuracy; highly customizable to individual domain • ... but place a serious tuning burden on domain users • Developed in many recent works • e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 • Now commonly adopted, with industrial-strength systems • e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig] Match selector Combiner Matcher … Matcher 1 Matcher n … Matcher 1 Matcher n LSD COMA SF LSD-SF
Match selector Constraint enforcer Combiner … Matcher 1 Matcher n Tuning Schema Matching Systems • Given a particular matching situation • how to select the right components? • how to adjust the multitude of knobs? Knobs of decision tree matcher Bipartite graph selector Threshold selector • Characteristics of attr. A* search enforcer Relax. labeler ILP •Split measure Average combiner Min combiner Max combiner Weighted sum combiner •Post-prune? •Size of validation set • q-gram name matcher Decision tree matcher Naïve Bays matcher • • TF/IDF name matcher SVM matcher Execution graph Library of matching components • Untuned versions produce inferior accuracy, however ...
... Tuning is Extremely Difficult • Large number of knobs • e.g., 8-29 in our experiments • Wide variety of techniques • database, machine learning, IR, information theory, etc. • Complex interaction among components • Not clear how to compare the quality of knob configs • Matching systems are still tuned manually, by trial and error • Multiple component systems make tuning even worse Developing efficient tuning techniques is crucial to making matching systems attractive in practice
The eTuner Solution • Given schema S & matching system M • tunes M to maximize average accuracy of matching S with future schemas • incurs virtually no cost to user • Key challenge 1: Evaluation • must search for “best” knob config • how to compute the quality of any knob config C? • if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C • but often have no such W • Key challenge 2: Search • how to efficiently evaluate the huge space of knob configs?
Key Idea: Generate Synthetic Input/Output Pairs • Need workload W = {(S,T1), (S,T2), …, (S,Tn)} • To generate W • start with S • perturb S to generate T1 • perturb S to generate T2 • etc. • Know the perturbation => know matches between S & Ti
3 3 3 3 12 12 12 12 Schema S 1 3 2 Key Idea: Generate Synthetic Input/Output Pairs V V1 1 Perturb # of tables 3 2 Perturb # of columnsin each table . Split S into V and U with disjoint data tuples . . EMPLOYEES Vn Perturb column and table names EMPLOYEES Perturb data tuples in each table U EMPS 1 3 2 EMPLOYEES EMPS EMPLOYEES EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) Ω1: a set of semantic matches V1 U
Examples of Perturbation Rules • Number of tables • merge two tables based on a join path • splits a table into two • Structure of table • merges two columns • e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) • drop a column • swap location of two columns • Names of tables/columns • rules capture common name transformations • abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc • Data values • rules capture common format transformations: 12/4 => Dec 4 • values are changed based on some distributions (e.g., Gaussian) See paper for details
The eTuner Architecture Tuning Procedures Perturbation Rules Workload Generator Staged Tuner Synthetic Workload Tuned Matching Tool M UΩ1 V1 UΩ2 V2 UΩn Vn Matching Tool M Schema S (Optional)
Match selector Constraint enforcer Combiner … Matcher 1 Matcher n The Staged Tuner • Tune sequentially starting with lowest-level components • Assume • execution graph has k levels, m nodes per level • each node can be assigned one of n components • each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs Level 4 Level 3 Tuning direction Level 2 Level 1
Empirical Evaluation Domains Matchingsystems
Matching Accuracy Domain-dependent eTUNER: Automatic Off-the-shelf Source-dependent eTUNER: Human-assisted Domain-independent 0.9 0.9 0.8 COMA LSD 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course 0.9 0.9 SF 0.8 LSD-SF 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Real Estate Product Inventory Course Real Estate Product Inventory Course eTuner achieves higher accuracy than current best methods, at virtually no cost to the user
Cost of Using eTuner • You have a schema S and a matching system M • Vendor supplies eTuner • will hook it up with matching system M • Vendor supplies a matching system M • bundles eTuner inside
Inventory Domain Real Estate Domain Average Sensitivity Analysis • Adding perturbation rules • Exploiting prior match results (enriching the workload) 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 0.5 Accuracy (F1) 0.4 0.3 0.3 Tuned LSD 0.2 0.2 0.1 0.1 0.0 0.0 1 10 20 25 40 50 0 22 44 66 88 Schemas in Synthetic Workload (#) Previous matches in collection (%)
Summary: The eTuner Project @ Illinois • Tuning matching systems is crucial • long standing problem, is getting worse • a next logical step in schema matching research • Provides an automatic & principled solution • generates a synthetic workload, employs it to tune efficiently • incurs virtually no cost to human users • exploits user assistance whenever available • Extensive experiments over 4 domains with 4 systems • Future directions • find optimal synthetic workload • apply to other matching scenarios • adapt ideas to scenarios beyond schema matching (see 3rd speaker)
Backup: User Assistance • S(phone1,phone2,…) • Generate V by dropping phone2: V(phone1,…) • Rename phone1 in V: V(x,…) • Problem: • x matches phone1, x does not match phone2 • User: • group phone1 and phone2 • so if x matches phone1, it will also match phone2 • Intuition: tell system do not bother to try distinguish phone1 and phone2