Efficient Management of Inconsistent and Uncertain Data

Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto

Contributors • Ariel Fuxman, PhD Thesis • Microsoft Search Labs • Jim Gray SIGMOD 2008 Dissertation Award • Periklis Andritsos, PhD • Jiang Du, MS • Elham Fazli, MS • Diego Fuxman, Undergrad

Dirty Databases No. I don’t see Any problem with the data • The presence of dirty data is a major problem in enterprises • Traditional solution: data cleaning

Limitations of Data Cleaning • Semi-automatic process • Requires highly-qualified domain experts • Time consuming • May not be possible to wait until the database is clean • Operational systems answer queries assuming clean data

Our Work Identify classes of queries for which we can obtain meaningful answers from potentially dirty databases Show how to do it efficiently and reusing existing database technology

Why is this Business Intelligence? • Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of information. • The goal of BI is to support better decision making, based on information. • DBMS should provide meaningful query answers even over data that is dirty

Outline • Introduction • Semantics for dirty databases • Contributions • Conclusions

A Data Integration Example Sales Integrated Customer Database Shipping Customer Support Web Forms Demographic Data Integrating customer data…

Matching and Merging Matching and merging are two fundamental tasks in data integration Web Sales

True Disagreement Between Sources What’s Peter’s salary? Web Sales

Inconsistent Integrated Databases SATISFY custid KEY VIOLATES custid KEY Inconsistent Integrated Database Web Sales In the absence of complete resolution rules…

Query: “Get customers who make more than 100K” Querying Inconsistent Databases Example: Offering a Platinum credit card… Peter,Paul,Mary Are we sure that we want to offer a card to Peter? web sales sales/web web sales

Aggressive: Get customers who possibly make more than 100K Peter, Paul, Mary Conservative: Get customers who certainly make more than 100K Paul, Mary Querying Inconsistent Databases

Formal Semantics • Related to semantics for querying incomplete data [Imielinski Lipski 84, Abiteboul Duschka 98] • Possible world: “complete” databases • Consistent answers • Proposed by Arenas, Bertossi, and Chomicki in 1999 • Corresponds to conservative semantics • Possible world: “consistent” databases

Consistent Answers Repairs Inconsistent database web sales sales/web web sales Key: custid

Consistent Answers Query=“Get customers who make more than 100K” Repairs q CONSISTENT ANSWER= {Paul,Mary} CONSISTENT ANSWERSAnswers obtained no matter which repair we choose q q q

When We Started… • Semantics well understood • Problem • Potentially HUGE number of repairs! • Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04] • Few tractability results [Arenas et al. 99, Arenas et al. 01] • Logic programming approaches [Bravo and Bertossi 03, Eiter et al. 03] • Expressive queries and constraints • Computationally expensive • Applicable only to small databases with small number of inconsistencies

Our Proposal: ConQuer SQL query q Keys Consistent answertoq ConQuer’s Rewriting Algorithm Commercial database engine Inconsistent database Rewritten SQL query Q*

Class of Rewritable Queries • ConQuer handles a broad class of SPJ queries with • Set semantics • Bag semantics, grouping, and aggregation • No restrictions on • Number of relations • Number of joins • Conditions or built-in predicates • Key-to-key joins • The class is “maximal”

Why not all SPJ queries? • Some SPJ queries cannot be rewritten into SQL • Consistent query answering is coNP-complete even for some SPJ queries and key constraints • Maximality of ConQuer’s class • Minimal relaxations lead to intractability • Restrictions only on • Nonkey-to-nonkey joins • Self joins • Nonkey-to-key joins that form a cycle

Example: A Rewritable Query • TPC-H Query 10 SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= '1993-10-01' and o_orderdate < date('1993-10-01') + 3 MONTHS and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc

Rewritings Can Get Quite Complex Rewriting of TPC-H Query 10 Can this rewriting be executed efficiently? 1.7 overhead 20 GB database, 5% inconsistency

Experimental Evaluation • Goals • Quantify the overhead of the rewritings • Assess the scalability of the approach • Determine sensitivity of the rewritten queries to level of inconsistency of the instance • Queries and databases • Representative decision support queries (TPC-H benchmark) • TPC-H databases, altered to introduce inconsistencies • Database parameters • database size • percentage of the database that is inconsistent • conflicts per key value (in inconsistent portion)

Scalability Worst Case 5.8 overhead Selectivity 98.56 % 5 % inconsistent tuples 2 conflicts per inconsistent key value Best Case 1.2 overhead Selectivity 0.001 % Size (GB)

Contributions – Theory • Formal characterization of a broad class of queries • For which computing consistent answers is tractable under key constraints • That can be rewritten into first-order/SQL • Query rewriting algorithms for a class of Select-Project-Join queries • With set semantics • With bag semantics, grouping, and aggregation • Maximality of the class of queries

Contributions – Practice • Implementation of ConQuer • Designed to compute consistent answers efficiently • Multiple rewriting strategies • Experimental validation of efficiency and scalability • Representative queries from TPC-H • Large databases

Uncertain Data PROVENANCE INFORMATION (e.g., source reputation) Integrated Database 0.3 Web 0.3 0.7 1 0.3 0.7 0.7 Sales

Publications and Demo • These and other contributions appear in • ICDT05/JCSS06 • SIGMOD05 • ICDE06 • PODS06/TODS06 • VLDB06 • Demo given at VLDB05 • http://queens.db.toronto.edu/project/conquer/demo2/

A Virtuous Cycle Query Answering Data Integration • Use knowledge about inconsistencies to: • give better answers • suggest ways to clean the database Recognize and characterize inconsistent data

Beyond the Enterprise • Can we apply principled models of inconsistency or uncertainty to the Web? • Different assumptions • Uncertainty in queries • There’s never a “true” answer • Challenge • Build models based on user preferences • Leverage massive repositories of user behavior data

THANK YOU Plug:Discovering Data Quality Rules, Fei Chiang Thursday 11:15am Research Session 33

Efficient Management of Inconsistent and Uncertain Data

Efficient Management of Inconsistent and Uncertain Data

Presentation Transcript

Probabilistic/Uncertain Data Management -- III

Uncertain Data Management

Managing Inconsistent Data in Data Integration and Data Exchange

Probabilistic/Uncertain Data Management

Probabilistic/Uncertain Data Management -- IV

Nectar: Efficient Management of Computation and Data in Data Centers

Uncertain Data Management for Sensor Networks

Inconsistent Systems

OLAP Over Uncertain and Imprecise Data

Clustering Uncertain Data

Probabilistic Queries and Uncertain Data

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

COMP9315 Uncertain and Probabilistic Data

Robust Ranking of Uncertain Data

Managing Uncertain Data

Inconsistent Heuristics

Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data

Inconsistent

ConQuer: Efficient Management of Inconsistent Databases

Efficient And Imperative Test Data Management Strategies

ConQuer: Efficient Management of Inconsistent Databases

Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data