1 / 20

OLAP Over Uncertain and Imprecise Data

OLAP Over Uncertain and Imprecise Data. T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar Vaithyanathan (IBM). Dimensions in OLAP. Automobile. All. Truck. Sedan. Civic. Camry. F150. Sierra. Location. All. East.

kaycee
Download Presentation

OLAP Over Uncertain and Imprecise Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar Vaithyanathan (IBM)

  2. Dimensions in OLAP Automobile All Truck Sedan Civic Camry F150 Sierra Location All East West CA TX MA NY

  3. Measures, Facts, and Queries p2 p6 p4 p8 p7 p5 Cell p3 Auto = Truck Loc = East SUM(Repair) = ? Automobile ALL Auto = F150 Loc = NY Repair = $200 Truck Sedan Civic Camry F150 Sierra MA East p1 NY ALL Location TX West CA

  4. Extend the OLAP model to handle data ambiguity • Imprecision • Uncertainty

  5. Imprecision Automobile ALL Auto = F150 Loc = East Repair = $200 Truck Sedan Civic Camry F150 Sierra p2 MA p9 p11 East NY p1 ALL p6 p4 p8 p10 p7 Location TX p5 West p3 CA

  6. Representing Imprecision using Dimension Hierarchies • Dimension hierarchies lead to a natural space of “partially specified” objects • Sources of imprecision: incomplete data, multiple sources of data

  7. Motivating Example p1 p3 p4 p2 Query: COUNT Truck F150 Sierra We propose desiderata that enable appropriate definition of query semantics for imprecise data MA p5 East NY

  8. Consistency specifies the relationship between answers to relatedqueries on a fixeddata set Desideratum I: Consistency p1 p3 p4 p2 Truck F150 Sierra MA p5 East NY

  9. Desideratum II: Faithfulness F150 Sierra F150 Sierra MA MA NY NY p1 p3 p1 p4 p2 p5 p3 p3 p5 p2 p4 p2 p1 p5 p4 Data Set 1 Data Set 2 Data Set 3 F150 Sierra MA NY • Faithfulness specifies the relationship between answers to a fixed query on related data sets

  10. Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator • Can we define query semantics that satisfy these desiderata?

  11. F150 F150 F150 F150 F150 Sierra Sierra Sierra Sierra Sierra MA MA MA MA MA p3 p4 p5 NY NY NY NY NY p2 p1 p4 p3 p5 Query Semantics p1 p2 p3 w1 p5 p4 w4 w2 w3 p2 p1 Possible Worlds[Kripke63,…] p5 p4 p4 p5 p3 p3 p2 p2 p1 p1

  12. Possible Worlds Query Semantics • Given all possible worlds together with their probabilities, queries are easily answered (using expected values) • But number of possible worlds is exponential!

  13. Allocation • Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data • Size increase is linear in number of (completions of) imprecise facts • Queries operate over this extended version • Key contributions: • Appropriate characterization of the large space of allocation policies • Designing efficient allocation policies that take into account the correlations in the data

  14. Storing Allocations using Extended Data Model F150 Sierra MA NY p3 p4 p5 Truck East p1 p2

  15. Classifying Allocation Policies Measure Correlation Ignored Used Ignored Uniform Dimension Correlation Count EM Used

  16. Results on Query Semantics • Evaluating queries over extended version of data yields expected value of the aggregation operator over all possible worlds • intuitively, the correct value to compute • Efficient query evaluation algorithms for SUM, COUNT • consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions • Dynamic programming algorithm for AVERAGE • Unfortunately, consistency does not hold for AVERAGE

  17. Alternative Semantics for AVERAGE • APPROXIMATE AVERAGE • E[SUM] / E[COUNT] instead of E[SUM/COUNT] • simpler and more efficient • satisfies consistency • extends to aggregation operators for uncertain measures

  18. Uncertainty • Measure value is modeled as a probability distribution function over some base domain • e.g., measure Brake is a pdf over values {Yes,No} • sources of uncertainty: measures extracted from text using classifiers • Adapt well-known concepts from statistics to derive appropriate aggregation operators • Our framework and solutions for dealing with imprecision also extend to uncertain measures

  19. Summary • Consistency and faithfulness • desiderata for designing query semantics for imprecise data • Allocation is the key to our framework • Efficient algorithms for aggregation operators with appropriate guarantees of consistency and faithfulness • Iterative algorithms for allocation policies

  20. Correlation-based Allocation • Involves defining an objective function to capture some underlying correlation structure • a more stringent requirement on the allocations • solving the resulting optimization problem yields the allocations • EM-based iterative allocation policy • interesting highlight: allocations are re-scaled iteratively by computing appropriate aggregations

More Related