340 likes | 350 Views
This paper discusses efficient allocation algorithms for OLAP queries over imprecise data. It proposes a template for allocation policies and presents an operational framework for allocation. The paper also introduces an allocation graph formalism and provides an extended database maintenance algorithm. Experimental evaluation is performed to evaluate the proposed algorithms.
E N D
Efficient Allocation Algorithms For OLAP Over Imprecise Data Doug Burdick University of Wisconsin – Madison Prasad Deshpande IBM India Research Lab, SIRC T.S. Jayram IBM Almaden Research Center Raghu Ramakrishnan Yahoo! Research Shivakumar Vaithyanathan IBM Almaden Research Center
Imprecise Data Multidimensional Data AUTOMOBILE 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region 1 Model Civic Camry F150 Sierra p3 p4 MA p5 East NY p1 p2 ALL LOCATION TX West CA • [BDJ+05] Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005
Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values
Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values
Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values More details for dimensions extracted from text in [BDJ+06] Burdick et al. OLAP Over Uncertain and Imprecise Data. To appear in VLDB Journal
Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values
Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values
Sources of Imprecision • Dimensions extracted from free text • Assume given extractor for Auto dimension values
Sources of Imprecision • Data Integration • Fact table constructed by integrating multiple data sources • Different sources record same dimension attribute at different granularities AUTOMOBILE 3 ALL ALL 2 Category Truck Sedan 1 Model Civic Camry F150 Sierra Mailing List Call Center
Imprecision In Real Data • Obtained real-world dataset from auto manufacturer • Fact table entries from several source relations • Integrated fact table contained 798,570 facts • Real data has many imprecise facts
Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? Truck F150 Sierra p5 MA p3 p4 East NY p1 p2
Solution: Allocation • Intuitively: Replace each imprecise fact r with set of precise facts, one for each possible completion of r • Each completion is assigned an allocation weight • Refer to the resulting fact table as the Extended Database (EDB) • Queries operate over this Extended Database
F150 Sierra Handle Imprecision With Allocation Truck p5 p5 MA p3 p4 East NY p1 p2
Querying The Extended Database Auto = F150 Loc = MA SUM(Repair) = ??? Truck F150 Sierra p5 p5 MA p3 p4 East NY p1 p2
Querying The Extended Database Auto = F150 Loc = MA SUM(Repair) = 150 Procedure for assigning allocation weights is referred to as an allocation policy Truck F150 Sierra p5 p5 MA p3 p4 East NY p1 p2
Contributions • Propose generalized template for allocation policies presented in [BDJ+05] • Present operational framework for allocation • Allocation graph formalism • Used to derive Independent, Block, Transitive Algorithms • Propose Extended Database Maintenance Algorithm • Update EDB to reflect changes to given fact table • Experimental Evaluation
F150 Sierra Allocation Policy Template Truck r MA c2 c1 East NY
Interactions between overlapping facts • Allocation weights for imprecise fact p6 depend on allocation weights for fact p7 (and vice-versa) • Would like assigned weights to capture these interactions • Idea: Repeatedly allocate p6 and p7 until allocation weights converge Truck F150 Sierra p5 p6 MA p4 p7 East NY p1 p2
Iterative Allocation Policies 1) Initialize each Q0(c) in cell c (using precise facts) 2) For each iteration t until all Qt(c)converged For each imprecise fact r For each cell c For each imprecise fact r overlapping c 3) For each imprecise fact r For each cell c in region(r)
Benefits of Iterative Allocation • Imprecise facts can be allocated in any order and same allocation weights are obtained • Leverage this idea to obtain scalable allocation algorithms • Leads to Expectation Maximization (EM) framework for allocation • Final allocation weights have pleasing mathematical properties • See [BDJ+05] for details
Allocation Graph Truck Truck F150 F150 Sierra Sierra p5 p5 MA MA p3 p3 p4 p4 p6 p6 c2 c2 c1 c1 East East p1 p1 p2 p2 NY NY Precise Cells Cell(NY,F150) Imprecise Facts Cell(NY,Sierra) Cell(MA,F150) <MA,Truck> Cell(MA,Sierra)
Processing WithAllocation Graph Truck Truck F150 F150 Sierra Sierra p5 p5 p5 MA MA p3 p3 p4 p4 p6 Initialize each Q0(c) in cell c c2 c2 c1 c1 East East p1 p1 p2 p2 NY NY Precise Cells Cell(NY,F150) Imprecise Facts Cell(NY,Sierra) 2 / 3 2 Cell(MA,F150) 3 <MA,Truck> 1 Cell(MA,Sierra) 1 / 3
Efficient Allocation Algorithms • Independent Algorithm • Requires multiple sorts of precise cells for each iteration • Optimizations based on re-using each sort as much as possible • Block Algorithm • Reduces the number of required sorts for precise cells to 1 • Optimizations based on increasing buffer utilization
S1:<State,Category> S2:<State, ALL> S3 :<Region,Category> S4 :<ALL,Model> S5 :<Region,Model> <MA,Sedan> p6 p7 <MA,Truck> <MA,Civic> p1 <CA,ALL> p8 p2 <MA,Sierra> <East,Truck> p9 <West,Sedan> p10 p3 <NY,F150> <ALL,Civic> p11 p4 <CA,Civic> <ALL,Sierra> p12 p5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>
Iteration aware allocation • Optimizations for Independent and Block reduce work for single iteration • Problem: Each iteration of allocation is still expensive • Involves multiple scans of entire fact table • Not feasible for real data warehouses! • Can we do better?
Required Data For Allocating A Fact <MA,Sedan> p6 p7 <MA,Truck> <MA,Civic> c1 <CA,ALL> p8 c2 <MA,Sierra> <East,Truck> ` p9 <West,Sedan> p10 c3 <NY,F150> <ALL,Civic> p11 c4 <CA,Civic> <ALL,Sierra> p12 c5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>
Required Data For Allocating A Fact p7 <MA,Truck> c2 <MA,Sierra> <East,Truck> p9 c3 Connected components in allocation graph can be processed independently <NY,F150> <ALL,Sierra> p12 <MA,Sedan> p6 <CA,ALL> p8 <MA,Civic> c1 <West,Sedan> p10 c4 <CA,Civic> <ALL,Civic> p11 c5 <CA,Sierra> <West,Civic> p13 p14 <West,Sierra>
Transitive Algorithm • Transitive Algorithm has two steps: • 1) Connected component identification step • 2) Process each connected component • Read component into memory • Perform all iterations of allocation for facts in component • If each component fits into memory then required I/O operations for Transitive is independent of number of iterations! • Components larger than buffer processed using Block algorithm • In real datasets, all components were memory resident Use concepts from Transitive Algorithm to develop EDB Maintenance Algorithm
Experimental Setup • Algorithms evaluated on several datasets • Real-world dataset: 798K facts , 4 dimensions • Used several synthetic datasets • Vary level of imprecision in the data • Percentage of imprecise facts • Severity of imprecision • Scalability (up to 5 million tuples) • Important parameter: Ratio of input table size to available memory • Memory limited to restricted buffer pool
Experiment 1a: Memory Resident Real Dataset
Experiment: Memory Resident (2) Synthetic Dataset (more imprecision)
Conclusions • Imprecision is a compelling real-world problem • Propose allocation as a solution • Allocation graph formalism • Basis for 3 scalable allocation algorithms • Independent, Block, Transitive • Transitive algorithm is quite intriguing • Performance is stable as number of iterations increase • Connected components algorithm identifies can be used in proposed EDB maintenance algorithm