350 likes | 524 Views
Efficient Management of Inconsistent and Uncertain Data. Renée J. Miller University of Toronto. Contributors. Ariel Fuxman , PhD Thesis Microsoft Search Labs Jim Gray SIGMOD 2008 Dissertation Award Periklis Andritsos, PhD Jiang Du, MS Elham Fazli, MS Diego Fuxman, Undergrad.
E N D
Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto
Contributors • Ariel Fuxman, PhD Thesis • Microsoft Search Labs • Jim Gray SIGMOD 2008 Dissertation Award • Periklis Andritsos, PhD • Jiang Du, MS • Elham Fazli, MS • Diego Fuxman, Undergrad
Dirty Databases No. I don’t see Any problem with the data • The presence of dirty data is a major problem in enterprises • Traditional solution: data cleaning
Limitations of Data Cleaning • Semi-automatic process • Requires highly-qualified domain experts • Time consuming • May not be possible to wait until the database is clean • Operational systems answer queries assuming clean data
Our Work Identify classes of queries for which we can obtain meaningful answers from potentially dirty databases Show how to do it efficiently and reusing existing database technology
Why is this Business Intelligence? • Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of information. • The goal of BI is to support better decision making, based on information. • DBMS should provide meaningful query answers even over data that is dirty
Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions
Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions
A Data Integration Example Sales Integrated Customer Database Shipping Customer Support Web Forms Demographic Data Integrating customer data…
Matching and Merging Matching and merging are two fundamental tasks in data integration Web Sales
True Disagreement Between Sources What’s Peter’s salary? Web Sales
Inconsistent Integrated Databases SATISFY custid KEY VIOLATES custid KEY Inconsistent Integrated Database Web Sales In the absence of complete resolution rules…
Query: “Get customers who make more than 100K” Querying Inconsistent Databases Example: Offering a Platinum credit card… Peter,Paul,Mary Are we sure that we want to offer a card to Peter? web sales sales/web web sales
Aggressive: Get customers who possibly make more than 100K Peter, Paul, Mary Conservative: Get customers who certainly make more than 100K Paul, Mary Querying Inconsistent Databases
Formal Semantics • Related to semantics for querying incomplete data [Imielinski Lipski 84, Abiteboul Duschka 98] • Possible world: “complete” databases • Consistent answers • Proposed by Arenas, Bertossi, and Chomicki in 1999 • Corresponds to conservative semantics • Possible world: “consistent” databases
Consistent Answers Repairs Inconsistent database web sales sales/web web sales Key: custid
Consistent Answers Query=“Get customers who make more than 100K” Repairs q CONSISTENT ANSWER= {Paul,Mary} CONSISTENT ANSWERSAnswers obtained no matter which repair we choose q q q
Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions
When We Started… • Semantics well understood • Problem • Potentially HUGE number of repairs! • Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04] • Few tractability results [Arenas et al. 99, Arenas et al. 01] • Logic programming approaches [Bravo and Bertossi 03, Eiter et al. 03] • Expressive queries and constraints • Computationally expensive • Applicable only to small databases with small number of inconsistencies
Our Proposal: ConQuer SQL query q Keys Consistent answertoq ConQuer’s Rewriting Algorithm Commercial database engine Inconsistent database Rewritten SQL query Q*
Class of Rewritable Queries • ConQuer handles a broad class of SPJ queries with • Set semantics • Bag semantics, grouping, and aggregation • No restrictions on • Number of relations • Number of joins • Conditions or built-in predicates • Key-to-key joins • The class is “maximal”
Why not all SPJ queries? • Some SPJ queries cannot be rewritten into SQL • Consistent query answering is coNP-complete even for some SPJ queries and key constraints • Maximality of ConQuer’s class • Minimal relaxations lead to intractability • Restrictions only on • Nonkey-to-nonkey joins • Self joins • Nonkey-to-key joins that form a cycle
Example: A Rewritable Query • TPC-H Query 10 SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= '1993-10-01' and o_orderdate < date('1993-10-01') + 3 MONTHS and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc
Rewritings Can Get Quite Complex Rewriting of TPC-H Query 10 Can this rewriting be executed efficiently? 1.7 overhead 20 GB database, 5% inconsistency
Experimental Evaluation • Goals • Quantify the overhead of the rewritings • Assess the scalability of the approach • Determine sensitivity of the rewritten queries to level of inconsistency of the instance • Queries and databases • Representative decision support queries (TPC-H benchmark) • TPC-H databases, altered to introduce inconsistencies • Database parameters • database size • percentage of the database that is inconsistent • conflicts per key value (in inconsistent portion)
Scalability Worst Case 5.8 overhead Selectivity 98.56 % 5 % inconsistent tuples 2 conflicts per inconsistent key value Best Case 1.2 overhead Selectivity 0.001 % Size (GB)
Contributions – Theory • Formal characterization of a broad class of queries • For which computing consistent answers is tractable under key constraints • That can be rewritten into first-order/SQL • Query rewriting algorithms for a class of Select-Project-Join queries • With set semantics • With bag semantics, grouping, and aggregation • Maximality of the class of queries
Contributions – Practice • Implementation of ConQuer • Designed to compute consistent answers efficiently • Multiple rewriting strategies • Experimental validation of efficiency and scalability • Representative queries from TPC-H • Large databases
Uncertain Data PROVENANCE INFORMATION (e.g., source reputation) Integrated Database 0.3 Web 0.3 0.7 1 0.3 0.7 0.7 Sales
Publications and Demo • These and other contributions appear in • ICDT05/JCSS06 • SIGMOD05 • ICDE06 • PODS06/TODS06 • VLDB06 • Demo given at VLDB05 • http://queens.db.toronto.edu/project/conquer/demo2/
Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions
A Virtuous Cycle Query Answering Data Integration • Use knowledge about inconsistencies to: • give better answers • suggest ways to clean the database Recognize and characterize inconsistent data
Beyond the Enterprise • Can we apply principled models of inconsistency or uncertainty to the Web? • Different assumptions • Uncertainty in queries • There’s never a “true” answer • Challenge • Build models based on user preferences • Leverage massive repositories of user behavior data
THANK YOU Plug:Discovering Data Quality Rules, Fei Chiang Thursday 11:15am Research Session 33