210 likes | 323 Views
OLAP Over Uncertain and Imprecise Data. T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar Vaithyanathan (IBM). Dimensions in OLAP. Automobile. All. Truck. Sedan. Civic. Camry. F150. Sierra. Location. All. East.
E N D
OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar Vaithyanathan (IBM)
Dimensions in OLAP Automobile All Truck Sedan Civic Camry F150 Sierra Location All East West CA TX MA NY
Measures, Facts, and Queries p2 p6 p4 p8 p7 p5 Cell p3 Auto = Truck Loc = East SUM(Repair) = ? Automobile ALL Auto = F150 Loc = NY Repair = $200 Truck Sedan Civic Camry F150 Sierra MA East p1 NY ALL Location TX West CA
Extend the OLAP model to handle data ambiguity • Imprecision • Uncertainty
Imprecision Automobile ALL Auto = F150 Loc = East Repair = $200 Truck Sedan Civic Camry F150 Sierra p2 MA p9 p11 East NY p1 ALL p6 p4 p8 p10 p7 Location TX p5 West p3 CA
Representing Imprecision using Dimension Hierarchies • Dimension hierarchies lead to a natural space of “partially specified” objects • Sources of imprecision: incomplete data, multiple sources of data
Motivating Example p1 p3 p4 p2 Query: COUNT Truck F150 Sierra We propose desiderata that enable appropriate definition of query semantics for imprecise data MA p5 East NY
Consistency specifies the relationship between answers to relatedqueries on a fixeddata set Desideratum I: Consistency p1 p3 p4 p2 Truck F150 Sierra MA p5 East NY
Desideratum II: Faithfulness F150 Sierra F150 Sierra MA MA NY NY p1 p3 p1 p4 p2 p5 p3 p3 p5 p2 p4 p2 p1 p5 p4 Data Set 1 Data Set 2 Data Set 3 F150 Sierra MA NY • Faithfulness specifies the relationship between answers to a fixed query on related data sets
Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator • Can we define query semantics that satisfy these desiderata?
F150 F150 F150 F150 F150 Sierra Sierra Sierra Sierra Sierra MA MA MA MA MA p3 p4 p5 NY NY NY NY NY p2 p1 p4 p3 p5 Query Semantics p1 p2 p3 w1 p5 p4 w4 w2 w3 p2 p1 Possible Worlds[Kripke63,…] p5 p4 p4 p5 p3 p3 p2 p2 p1 p1
Possible Worlds Query Semantics • Given all possible worlds together with their probabilities, queries are easily answered (using expected values) • But number of possible worlds is exponential!
Allocation • Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data • Size increase is linear in number of (completions of) imprecise facts • Queries operate over this extended version • Key contributions: • Appropriate characterization of the large space of allocation policies • Designing efficient allocation policies that take into account the correlations in the data
Storing Allocations using Extended Data Model F150 Sierra MA NY p3 p4 p5 Truck East p1 p2
Classifying Allocation Policies Measure Correlation Ignored Used Ignored Uniform Dimension Correlation Count EM Used
Results on Query Semantics • Evaluating queries over extended version of data yields expected value of the aggregation operator over all possible worlds • intuitively, the correct value to compute • Efficient query evaluation algorithms for SUM, COUNT • consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions • Dynamic programming algorithm for AVERAGE • Unfortunately, consistency does not hold for AVERAGE
Alternative Semantics for AVERAGE • APPROXIMATE AVERAGE • E[SUM] / E[COUNT] instead of E[SUM/COUNT] • simpler and more efficient • satisfies consistency • extends to aggregation operators for uncertain measures
Uncertainty • Measure value is modeled as a probability distribution function over some base domain • e.g., measure Brake is a pdf over values {Yes,No} • sources of uncertainty: measures extracted from text using classifiers • Adapt well-known concepts from statistics to derive appropriate aggregation operators • Our framework and solutions for dealing with imprecision also extend to uncertain measures
Summary • Consistency and faithfulness • desiderata for designing query semantics for imprecise data • Allocation is the key to our framework • Efficient algorithms for aggregation operators with appropriate guarantees of consistency and faithfulness • Iterative algorithms for allocation policies
Correlation-based Allocation • Involves defining an objective function to capture some underlying correlation structure • a more stringent requirement on the allocations • solving the resulting optimization problem yields the allocations • EM-based iterative allocation policy • interesting highlight: allocations are re-scaled iteratively by computing appropriate aggregations