Making Holistic Schema Matching Robust: An Ensemble Approach

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

Background: MetaQuerier – large-scale integration of the deep Web Query Result MetaQuerier The Deep Web

The Deep Web MetaQuerier: System architecture [CIDR’05] MetaQuerier Front-end: Query Execution Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Organization Schema Matching

Matching query interfaces (QIs) Book Domain m:n complex matching 1:1 simple matching Music Domain

Pairwise Matching S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category Traditional approaches of schema matching – Pairwise attribute correspondence • Typical matching approaches • Cupid [VLDB’01] • LSD [SIGMOD’01] • Scale is a challenge • Only small scale • Large-scale is a must for our task • Scale is an opportunity • Context information is not exploited • similar attributes across multiple schemas • co-occurrence patterns among attributes

S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Emerging paradigm: Holistic schema matching approach • Match many schemas at the same time and find all the matchings at once Input: a set of schemas Output: a ranked list of matchings Holistic Schema Matching author = writer = name subject = category format = binding

Various techniques to realize holistic matching • Matching as hidden model discovery: Model generative behavior of schemas from attributes and their semantic relationships • The MGS framework [SIGMOD’03] • Matching as correlation mining: The correlation of attributes across sources reflect complex relationships • The DCM framework [KDD’04] • Matching as clustering: Attributes in two schemas may be similar through attributes in other schemas • Interactive clustering based matcher [SIGMOD’04] • WISE-Integrator [VLDB’03]

Holistic matching is, in essence– Data mining to discover semantics for information integration Generation • Hypothesis Observations (attribute occurrences) Semantics (semantic correspondences) Hidden Regularities • Holistic matching approach Statistical Analysis hidden model discovery correlation mining clustering

The baseline holistic matching architecture with matching as correlation mining AA.com United.com Expedia.com Delta.com The DCM matcher {adult, child, senior} = passenger departure date = depart

The Deep Web The challenge in holistic input: Noisy data quality Database Crawler Interface Extraction With the mining nature, holistic matching suffers the inherent problem of noisy data quality! • Noisy input is inevitable • extraction of QIs may contain errors • organization of QIs may not be fully accurate Holistic Schema Matching Source Organization

Example of errors in interface extraction Result of extraction: AA.com The correlation between (adult, children) and passenger is affected by a single extraction error!

A general solution The impact of noises: Error cascade Accuracy Ai Accuracy Aj Accuracy = Aj? Error Cascade Q: Errors are often minority, why cascade? A: The technique of a semantics related task, e.g., data integration, is often context-sensitive: constraints, heuristics, measures, parameters, procedures Accuracy = Ai*Aj? (e.g., Interface Extraction) (e.g., Holistic Schema Matching) Sampling and voting techniques: The ensemble framework

The intuition of the ensemble idea 1) Contain sufficient good schemas to mine matchings 2) Contain fewer noises to have more chance to sustain the holistic matcher • Sampling: a way to reduce noises in the input Sampling • Voting: a single sampling may be biased, so let us repeat it multiple times and then vote It is likely that the holistic matcher can be sustained in most samples

S3: writer title category format S3: writer title category format S1: author title subject ISBN S1: author title subject ISBN S2: name title keyword binding S2: name title keyword binding author = name = writer author = name = writer subject = category subject = category The ensemble framework for holistic schema matching 1st trial Tth trial Multiple Sampling Sampling Sampling Holistic Schema Matching Holistic Schema Matching Holistic Schema Matching Rank Aggregation Voting

How the ensemble framework works: An example 1. author = ISBN 2. publisher= category 3. author = name Holistic Schema Matching Holistic Schema Matching 1. author = name 2. subject = category 3. author = ISBN 1. author = name 2. subject = category 3. author = ISBN Holistic Schema Matching 1. subject = category 2. author = ISBN 3. author = name Holistic Schema Matching 1. author = name 2. publisher = category 3. author = ISBN Please refer to our paper for more formal analysis

The ensemble idea is inspired by bagging predictors • Bagging is used in machine learning to maintain the accuracy of a classifier with the presence of biased distribution of input data • We are essentially applying bagging techniques in a new scenario of schema matching • However, we are different in • setting: supervised vs. unsupervised • technique: sampling and voting tech • analytic model: our modeling is specific to matching

Configuration of multiple sampling • The configuration dilemma • Sample size S • If S is too small, the sampled data may not be sufficiently representative • If S is too large, the sampled data may contain too many noises • Number of trials T • If T is too small, the voting result may not be sufficiently convincing • If T is too large, more execution time is needed • Two ways to choose S and T • ST: first choose an S, then derive an appropriate T • TS: first choose an T, then derive an appropriate S • TS is better than ST, since the accuracy is very sensitive to S, not T

Aggregating matchings from all trials: Enforcing the majority matching results • Each trial outputs a ranked list of matchings • Voting is thus to aggregate a set of ranked list into a single ranked list R, which reflects the ranking results in the majority • Candidate selection • If the majority of trials do not find a matching M, M is not considered as a correct matching and thus does not appear in R • Ranking aggregation • If the majority of trials ranks M1 higher than M2, it will be good if we can also rank M1 higher than M2 in R

An example of voting T1: T2: T3: 1. author = name 2. subject = category 3. author = ISBN 1. subject = category 2. author = ISBN 3. author = name 1. author = name 2. publisher = category 3. author = ISBN All Matchings: M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category Candidate Selection: M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category Rank Aggregation: Borda’s aggregation:B(Mi) = Σ rank of Mi in Tj B(M1) = 1 + 3 + 1 = 5, B(M2) = 2 + 1 + 3 = 6, B(M3) = 3 + 2 + 2 = 7 Rank matchings according to B(Mi) M1. author = name M2. subject = category M3. author = ISBN

Experimental setup • Subsystems integration scenario • Interface Extraction + Holistic Schema Matching • Interface Extractor [SIGMOD’04] • The DCM Matcher [KDD’04] • Datasets • Two representative domains in the TEL-8 dataset in UIUC Web Integration Repository • Books and Airfares • http://metaquerier.cs.uiuc.edu/repository/

Experimental result: Baseline vs. Ensemble Baseline approach (a) Precision of Books (b) Precision of Airfares (c) Recall of Books (d) Recall of Airfares Ensemble approach

Experimental result: Outliers vs. Missing Data • Upper bound exists • Two types of data quality problems • Outliers (noises) • Missing data • Outliers • data ideally should not be observed, but observed • can be solved by the ensemble approach • Missing data • data ideally should be observed, but not • cannot be solved by the ensemble approach (a) Precision of Books (b) Precision of Airfares (c) Recall of Books (d) Recall of Airfares

Contributions • Problem • noisy data quality is an inherent challenge for large scale schema matching • critical for sustaining holistic schema matching as a practical and viable technique • Solution • an ensemble framework with sampling and voting techniques, inspired by bagging predictors • we are essentially applying bagging techniques in a new scenario of schema matching

Thank You!

Making Holistic Schema Matching Robust: An Ensemble Approach