Data Integration with Uncertainty

Data Integration with Uncertainty Xin (Luna) Dong Data Management Dept @ AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma (Stanford), Daisy Zhe Wang (Berkeley), Cong Yu (Yahoo!)

D5 D1 D2 D4 D3 Many Applications Need to Manage Heterogeneous Data Sources Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Bib(title, authors, conf, year)

Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration Systems [Ullman, 1997] MappingSELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Publication(title, author, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid)

Mediated Schema D5 D1 D2 D4 D3 Querying on Traditional Data Integration Systems Q Q Q Q Q5 Q1 Q4 Q2 Q Q Q3

Mediated Schema Uncertainty Can Occur at Three Levels in Data Integration Applications III. Query Level Q Q II. Semantics Level D5 D1 D2 D4 I. Data Level D3

Q1..Qm Mediated Schema Data Integration with Uncertainty II. Keyword QueryReformulation III. Top-K Query Answering Q PM1 PM5 P-Med-Schema PM2 PM4 PD5 D5 PD1 D1 PM3 PD2 D2 PD4 D4 I. Probabilistic Model PD3 D3

Our Approach • Many applications require only approximate answers and do not need full integration. • Pay-as-you-go data integration [Halevy, et al., 2005] • Handle uncertainty and provide best-effort services from the outset • Evolve mappings between data sources over time • This talk: build a completely self-configuring data integration system by applying probabilistic models.

Q1..Qm Technical Contributions • Probabilistic schema mappings • Query answering w.r.t. pMs[VLDB’07] • Automatic creation [Sigmod’08] • 2. Probabilistic mediated schema[Sigmod’08] • Expressiveness • Automatic creation Q PM1 PM5 P-Med-Schema PM2 PM4 PD5 D5 PD1 D1 PM3 PD2 D2 PD4 D4 • 3. Probabilistic Integrity Constraints[WebDB’09] • Automatic creation • An application PD3 D3

Q1..Qm Roadmap • Probabilistic schema mappings • Query answering w.r.t. pMs[VLDB’07] • Automatic creation [Sigmod’08] • 2. Probabilistic mediated schema [Sigmod’08] • Expressiveness • Automatic creation Q PM1 PM5 P-Med-Schema PM2 PM4 PD5 D5 PD1 D1 PM3 PD2 D2 PD4 D4 • 3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation • An application PD3 D3

m1 T(name, email, mailing-addr, home-addr, office-addr) 0.5 S(pname, email-addr, current-addr, permanent-addr) m2 T(name, email, mailing-addr, home-addr, office-addr) 0.4 S(pname, email-addr, current-addr, permanent-addr) m3 T(name, email, mailing-addr, home-addr, office-addr) 0.1 S(pname, email-addr, current-addr, permanent-addr) Example Probabilistic Mappings T(name, email, mailing-addr, home-addr, office-addr) Q: SELECT mailing-addr FROM T Mediated Schema Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S S

m1 T(name, email, mailing-addr, home-addr, office-addr) 0.5 S(pname, email-addr, current-addr, permanent-addr) m2 T(name, email, mailing-addr, home-addr, office-addr) 0.4 S(pname, email-addr, current-addr, permanent-addr) m3 T(name, email, mailing-addr, home-addr, office-addr) 0.1 S(pname, email-addr, current-addr, permanent-addr) Top-k Query Answering w.r.t. Probabilistic Mappings T(name, email, mailing-addr, home-addr, office-addr) Q: SELECT mailing-addr FROM T Mediated Schema Q1: SELECT current-addr FROM S Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT email-addr FROM S S

Schema Mapping • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr) • Mappings: one-to-one schema matching • Queries: select-project-join queries

Probabilistic Mapping • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr)

By-Table v.s. By-Tuple Semantics

By-Table v.s. By-Tuple Semantics Ds= DT= Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1

By-Table v.s. By-Tuple Semantics Ds= … DT= Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

By-Table Query Answering Ds= DT= 0.5 0.4 0.1 SELECT mailing-addr FROM T

By-Tuple Query Answering Ds= … DT= 0.05 0.04 0.01 SELECT mailing-addr FROM T

Complexity of Query Answering

By-Table Query Answering Query Rewriting SELECT home-addr FROM S(0.5) SELECT office-addr FROM S (0.4) SELECT mailing-addr FROM T SELECT email-addr FROM S (0.1) Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping

By-Tuple Query Answering Ds= Target Enumeration DT= SELECT mailing-addr FROM T … 0.05 0.04 0.01 0.2 Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping

More on By-Tuple Query Answering • The high complexity comes from computing probabilities • Theorem: Even computing the probability for one possible answer is #P-hard • Theorem: Computing all possible answers w/o probabilities is in PTIME • In general by-tuple query answering cannot be done by query rewriting • There are two subsets of queries that can be answered in PTIME by query rewriting

SELECT office-addr FROM S SELECT email-addr FROM S SELECT home-addr FROM S PTIME for Queries with a Single P-Mapping Target SELECT mailing-addr FROM T Pr=0.5 Pr=0.4 Pr=0.1 =1-(1-0.4)(1-0.5-0.4) =1-(1-0.5)(1-0)

SELECT hightechFROM V SELECT mailing-addr FROM T PTIME for Queries that Return Join Attributes SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech =0.94*0.8

SELECT hightechFROM V SELECT mailing-addr FROM T Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech ≠0.94*0.8+0.5*0.8

Extensions to More Expressive Mappings • The complexity results for query answering carry over to three extensions to more expressive mappings • Complex mappings • E.g., address  street, city and state • GLAV mappings • E.g., Paper Authorship Author  Publication • Conditional mappings: • E.g., if age>65, Pr(home-addr  mailing-addr)=0.8 if age<=65, Pr(home-addr  mailing-addr)=0.5

Creation: 1) Matching Attributes • Current schema matching techniques can compute similarity between source attributes and target attributes---Weighted attribute correspondences • Goal: find a p-mapping that is consistent w. a set of weighted correspondences • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.2 0.8 0.9 0.9

Creation: 1) Matching Attributes (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr)     0.2 0.8 0.9 0.9

Creation: 2) Normalization • Theorem: Let C be a set of weighted correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. • Approach: Normalization • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.2 0.8 0.9 0.9

Creation: 2) Normalization • Theorem: Let C be a set of weighted correspondences. There exists a p-mapping consistent w. C if and only if for every src/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. • Approach: Normalization • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.2 0.8 0.5 0.5

Creation: 2) Normalization (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr)     0.2 0.8 0.5 0.5

Creation: 3) Generating p-Mappings • However, different p-mappings can be consistent w. the same set of weighted correspondences. Solution: Choose the p-mapping w. the maximum entropy. Algorithm: Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C. Solve the following optimization problem:

Creation: 3) Generating p-Mappings (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.2 0.8 0.5 0.5

Creation: 3) Generating p-Mappings (num, pname, home-addr, office-addr) (name, mailing-addr) p=.3 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.5 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.2 • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.2 0.8 0.5 0.5

Creation: 3) Generating p-Mappings • However, different p-mappings can be consistent w. the same set of weighted correspondences. • Solution: Choose the p-mapping w. the maximum entropy. • Algorithm: • Given C, enumerate all possible one-to-one mappings with a subset of correspondences in C. • Solve the following optimization problem:

Creation: 3) Generating p-Mappings (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.4 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 (num, pname, home-addr, office-addr) (name, mailing-addr) p=.1 • S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.2 0.8 0.5 0.5 Entropy = 0.52 (v.s. 0.45)

Related Work • Approximate schema mappings • Using top-k schema mappings can increase the recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007] • Generating top-k schema mappings by combining results by various matchers [Gal, 2007]

Q1..Qm Roadmap • 1. Probabilistic schema mappings • Query answering w.r.t. pMs [VLDB’07] • Automatic creation [Sigmod’08] • 2. Probabilistic mediated schema[Sigmod’08] • Expressiveness • Automatic creation Q PM1 PM5 P-Med-Schema PM2 PM4 PD5 D5 PD1 D1 PM3 PD2 D2 PD4 D4 • 3. Probabilistic Integrity Constraints [WebDB’09] • Automatic creation • An application PD3 D3

An Example Mediated Schema {name, person-name} {email} {phone-num, phone} {address, mailing-addr} Med-S (name, email, phone, addr) S1(name, email, phone-num, address) S2(person-name,phone,mailing-addr) • A mediated schema can be considered as a clustering of important attributes in source schemas

Why Probabilistic Mediated Schema? Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) ? S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, hPhone, oPhone FROM Med

Why Probabilistic Mediated Schema? Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Why Probabilistic Mediated Schema? Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Why Probabilistic Mediated Schema? Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Why Probabilistic Mediated Schema? Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address) Q: SELECT name, phone, address FROM Med

Our Solution Probabilistic Mediated Schema Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Pr=.5 Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=.5 S1(name, hPhone, oPhone, hAddr, oAddr) S2(name,phone,address)

P-Mappings w.r.t. P-Med-Schema PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04

Query Answeringw.r.t. P-Mappings & P-Med-Schema S1 Q SELECT name, phone, address FROM Med-S Answers

Expressive Power of P-Med-Schema v.s. P-Mapping Theorem 1. For one-to-many mappings (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) > (mediated schema + p-mapping)

Creation: 1) Creating a Single Med-Schema Input: Single-table source schemas S1, …, Sn Output: Single-table mediated schema M Algorithm Remove all infrequent attributes Find similarity between every pair of attributes and construct a weighted graph Remove edges with weight below τ (e.g., τ=.5) Each connected com-ponent is a cluster S1 name address email-address S2 pname home-address .6 .6 1 .2

Data Integration with Uncertainty