490 likes | 897 Views
Managing Inconsistent Data in Data Integration and Data Exchange . Renée J. Miller University of Toronto Periklis Andritsos, Ariel Fuxman Tasos Kementsietsidis, Yannis Velegrakis. Outline. Schema Mapping – reconciling differences in schemas Clio: Creating mappings (VLDB00,VLDB02)
E N D
Managing Inconsistent Datain Data Integration and Data Exchange Renée J. Miller University of Toronto Periklis Andritsos, Ariel Fuxman Tasos Kementsietsidis, Yannis Velegrakis
Outline • Schema Mapping – reconciling differences in schemas • Clio: Creating mappings (VLDB00,VLDB02) • Using semantics of schemas and data • ToMAS: Managing schema mappings (VLDB03) • Evolving schemas and semantics • Using Mappings • Data Exchange (ICDT03) • Querying Inconsistent Data (IJCAI/IIWeb03) • Data Mapping – reconciling differences in data • Hyperion: managing data mappings (SIGMOD03) • Using networks of P2P data mappings R.J. Miller - U. Toronto
Q Mapping “conforms to” “conforms to” data data Mapping Independent Data Sources Source Schema S’’ Source Schema S’ data Source Schema S Target Schema T data • Data Integration – answer target queries using data from source(s) • Target data is virtual R.J. Miller - U. Toronto
Q Mapping “conforms to” data Mapping Independent Data Sources Source Schema S Target Schema T “conforms to” data • Data Exchange – answer target queries answered locally • Target data is materialized R.J. Miller - U. Toronto
Overview • Goal: interoperability between independent data sources • Creating Mappings • Managing Mappings – as sources change • Using Mappings – to query and exchange data • Even when data is dirty or inconsistent • Challenges • Schemas can be arbitrarily different • Still, data must not lose its meaning • Use semantics embedded in schemas & data • Facilitate specification of any additional semantics • Performed manually: complex user queries, programs, etc. • Hard to debug; understand; verify correctness R.J. Miller - U. Toronto
Q Mapping “conforms to” Schema Mapping • Wants data from S • Understands T • May not understand S • XML Schema • DTD • Relational Source schema S Target schema T “conforms to” data • Automate (to the extent possible) the creation of mappings • Mappings used for (virtual) data integration or (materialized) data exchange R.J. Miller - U. Toronto
Illustration: Mapping Creation • Support Nested Structures • Element correspondences • Human friendly • Automatic discovery • Preserve data meaning • Discover data associations • Use constraints & schema • Create New Target Values • Produce Correct Grouping R.J. Miller - U. Toronto
Creating Correspondences • Graphical User Interface • DBA interactively specifies • Automatic Discovery • Attribute (Element) Classifier • Extensible to • Other Schema Matchers • VLDB J. 01 Survey • Correspondence based on syntactic information • Within schema or data R.J. Miller - U. Toronto
Interpreting Correspondences statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project What semantics do we associate to an arrow? Good enough for one arrow ! cidexpenseDB.companiescidstatDB.cityStat.orgs Still works for these two arrows! cid,nameexpenseDB.companiescid,namestatDB.cityStat.orgs How about now ? gidexpenseDB.grantsgidstatDB.cityStat.orgs.fundings R.J. Miller - U. Toronto
Associations btw Elements statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings:Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project • We must recognize that grants are associated to companies • Association (in the source): grants⋈ companies • Association (in the target):statDB⋈ orgs ⋈ fundings ⋈ financials R.J. Miller - U. Toronto
Schema Mapping • Enumerate ALLlogical associations consistent with schema semantics • Constraints • Nesting (schema structure) • Data • Interpret correspondences (arrows) over pair source & target association R.J. Miller - U. Toronto
Mappings as Views q • Views: st have a special form: • GAV:Qs(S) Ti where Ti is a relation in T, Qs is a query on S • Company(C,N,Ct),Grant(C,G,A,P) Projects(P,Ct) • Plain old view: create view projects (p, ct) as (select p,ct from …) • LAV:Si Qt(T) where Si is a relation in S , Qt is a query on T • Company(C,N,City) Org(C,N) City(C,Ct) Target schema T (Global) Source schema S (Local) t st virtual ! I J R.J. Miller - U. Toronto
Clio Mappings q • Clio Schema Mapping: • Qs(S) Qt(T) and constraints on S and T (s , t ) • More general than views • Generality often required when S, T are fixed • No design control Target schema T (Global) Source schema S (Local) t st Virtual or Materialized I J R.J. Miller - U. Toronto
Using Mappings and Views q • Data Integration • The target is not materialized; it is just a querying interface • Queries are posed on the target schema; data is in the source. • Problem: how to answer the query in the “best” possible way • AKA: Answering queries using views • GAV/LAV (mostly) assumes conjunctive queries • (mostly) assumes no target constraints – target is a view • Uses relational (not nested relational) model Target schema T Source schema S t st virtual ! I J R.J. Miller - U. Toronto
Using Mappings q • Data Exchange • The target is materialized • Queries are posed on the target schema; answered using target data • Problem: what is “best” instance to exchange • Mapping:Qs(S) Qt(T) and constraints on S and T (s , t ) • Given instance of S there may be many instances of T • Which is best instance to exchange? • Grant(C,G,A,P,S) Funding(C,G,Aid),Financials(Aid,D,A) Target schema T (Global) Source schema S (Local) t st Materialize! I J R.J. Miller - U. Toronto
Semantics of Query Answering • Answering queries using views • Query is answered using source data • Answer is set of tuples in query result on ALL possible target instances: certain answers • Data Exchange • Query is answered using ONE materialized target • Can single target give same information as source(s)? • Is query result the same in both settings? R.J. Miller - U. Toronto
Mappings at Data Level Employee Salary Employee Salary Employee Salary Financial Target Database HumanResources • Mapping • Financial(e,s) Global(e,s) • HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto
Data Inconsistency Employee Salary Employee Salary Employee Salary Financial Target Database HumanResources Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto
Reconciling Inconsistencies (I) Employee Salary Employee Salary Employee Salary 1 – Delete all tuples for John Financial Target Database HumanResources Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto
Reconciling Inconsistencies (II) Employee Salary Employee Salary Employee Salary 2 – Delete the salaries of John Financial Target Database HumanResources Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto
Reconciling Inconsistencies (III) Employee Salary Employee Salary Employee Salary 3 – Delete only one tuple for John Financial Target Database HumanResources Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto
Repairing an integrated database Employee Salary Repair 1 An integrated inconsistent database R.J. Miller - U. Toronto
Repairing an integrated database Repair 1 An integrated inconsistent database Repair 2 R.J. Miller - U. Toronto
Consistent Query Answers Repair 1 • Intuition: • Input: query Q • Get a query result Q( ) for each • repair . • A tuple is in the consistent answer • if it appears in all query results. Repair 2 R.J. Miller - U. Toronto
Consistent Query Answers Repair 1 Q(e,s)=Target(e,s) “Get all employees and their salaries” Repair 2 R.J. Miller - U. Toronto
Consistent Query Answers Repair 1 Q(e,s)=Target(e,s) “Get all employees and their salaries” Repair 2 ConsistentS(Q,I)={(Mary,3000)} R.J. Miller - U. Toronto
Consistent Query Answers Repair 1 Q(e)= s: Global(e,s) “Get all employees” Repair 2 R.J. Miller - U. Toronto
Consistent Query Answers Result 1 Q(e)= s: Global(e,s) “Get all employees” Result 2 ConsistentS(Q,I)={(John),(Mary)} R.J. Miller - U. Toronto
Consistent Query Answers Repair 1 Q= e Target(e,2000) “Is there an employee who earns $2000?” Repair 2 R.J. Miller - U. Toronto
Consistent Query Answers Repair 1 FALSE Q= e Target(e,2000) “Is there an employee who earns $2000?” Repair 2 ConsistentS(Q,I)=FALSE TRUE R.J. Miller - U. Toronto
Our work (IJCAI/IIWeb03) Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] R.J. Miller - U. Toronto
Our work Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] Goal: Find a class of tractable queries (i.e., the consistent answers can be retrieved in polynomial time without explicitly building all repairs). R.J. Miller - U. Toronto
Example: A tractable query Are there two employees with the same salary? Inconsistent instance Graph of the inconsistent instance 1000 John Mary 2000 Anna 3000 Employee Salary R.J. Miller - U. Toronto
Example: A tractable query 1000 John Mary 2000 Anna 3000 R.J. Miller - U. Toronto
Example: A tractable query 1000 John Mary 2000 Anna 3000 R.J. Miller - U. Toronto
Inexpressibility result • Query rewriting • Input: query Q • Output: query Q’ s.t. Q’(I)=consistentS(Q,I) for every I. • Appealing approach • tractable • reuses existing DBMSs • BUT: so far known to be applicable only to a restricted classes of queries ([ABC, PODS 1999]) R.J. Miller - U. Toronto
Inexpressibility result • Can we use query rewriting? R.J. Miller - U. Toronto
Inexpressibility result • Can we use query rewriting? NO R.J. Miller - U. Toronto
Practical Considerations (I) Conflicts are usually confined to a small portion of the database Robert 4000 Fred 5000 Paul 6000 7000 Peter 1000 John 2000 Mary Anna 3000 R.J. Miller - U. Toronto
Practical Considerations (I) Conflicts are usually confined to a small portion of the database 1000 John 2000 Mary Anna 3000 R.J. Miller - U. Toronto
Practical Considerations (II) Reasonable assumption in integration and exchange: constant number of conflicts per key. Financial Target Database Employee ! Salary Human Resources Employee ! Salary R.J. Miller - U. Toronto
Bibliography J. Chomicki and J. Marcinkowski. On the Computational Complexity of Consistent Query Answers. coRR cs.DB/0204010, 2002. M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases, Proc. ACM PODS, 1999. Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases, Proc. ACM PODS, 2003. R.J. Miller - U. Toronto
Data Mapping (SIGMOD03) • What if sources unwilling to share schemas? • Common in more autonomous P2P settings • How can such sources share data? • Shared schema mappings not appropriate • Need to manage and share • Data mappings • Hyperion – P2P data sharing R.J. Miller - U. Toronto
P2P File-Sharing Systems Currently, P2P querying relies on the use of value searches. e.g., retrieve songs for music band “NewOrder” However, P2P query mechanisms do not capture the intricacies of values, i.e., that values are often associated to each other. e.g. the value “New Order” is an alias for the value “Joy Division” We propose the use of mapping tables to record such associations e.g. a mapping table that records artist aliases R.J. Miller - U. Toronto
A P2P Genome Database System Peers store information about genes, proteins, etc. SwissProt(pid, name) “alias” Gene(gid, name) Characteristics of mapping tables: • The recorded associations can be 1:1, 1:n or m:n • They are, in general, non-binary • They associate values within or across domains R.J. Miller - U. Toronto
Contributions State of the art: Mapping tables represent expert knowledge. Currently, they are created manually by domain specialists. Our contributions: We automate the creation and maintenance of these tables. More specifically: • We investigate alternative semantics for mapping tables. • We motivate why reasoning capabilities are needed to manage them. • We propose efficient algorithms for both finding inconsistencies in mapping tables and inferring new mapping tables R.J. Miller - U. Toronto
Conclusions • Managing Data Inconsistency • Tolerate inconsistency • Identify inconsistency at query time • Recognizes that cleaning not always possible or desirable • Reconciling inconsistency • Data mappings record reconciliation • Manage use and combination of data mappings www.cs.toronto.edu/db www.cs.toronto.edu/db/tomas www.cs.toronto.edu/db/hyperion www.cs.toronto.edu/~miller R.J. Miller - U. Toronto