Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD 2001, Santa Barbara

Talk Outline • Aggregate Queries • Motivation for Approximate Answering • Multi-Resolution Aggregate Tree (MRA-Tree) • Progressive Algorithm with Error Bounds • Experimental Evaluation • Summary and Future Work

9 6 3 8 2 7 Q Aggregate Queries minQ = 2 maxQ = 7 countQ = 3 sumQ = 2+7+6 = 15 avgQ = 15/3 = 5 S

Evaluating Aggregate Queries • Exact answering • Scan all points of D checking each against Q • Retrieve points in Q via a multi-dimensional index on D • Both linear/index scan can be very expensive • Approximate answering • Many applications (selectivity estimation, data analysis, visualization) do not require exact answers

Boss How many tanks 10 miles from me? Motivating Examples My boss needs to see the income aggregates in 10 minutes!

Techniques for Approximate Aggregate Queries • Online estimation (Interactive) • Sampling • Offline estimation (Data Synopsis) • Sampling, Histograms, Wavelets • Our Technique: • Online estimator via a scan of a modified multi-dimensional index (MRA-Tree) • Allows incremental tradeoff of accuracy for response time, with guaranteed error bounds

Multi-Resolution Aggregate Tree (MRA-Tree) • An MRA-Tree can be instantiated with any of the popular multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.) • A non-leaf node contains (for each of its subtrees) four aggregates {MIN,MAX,COUNT,SUM} • A leaf node contains the actual data points • Tree operations are identical with those of the plain (non-MRA) tree with the consideration that aggregates must be maintained

6 2 2 1 2 4 3 5 4 1 MRA-Tree Example 2 min max count sum 4 1 6 Non-Leaf Node 4 5 2 6 3 2 4 1 9 9 4 6 Leaf Nodes

Progressive Algorithm Outline • We want • Best answer for given time • Shortest time for given precision of the answer • Refine an answer at will, trading time for precision • How we achieve it • Do a prioritized traversal of nodes of the MRA-tree • Maintain an estimate of the answer E(aggQ) • Maintain a 100% interval of confidence I = [L, H], such that L aggQ H

is contained contains Q Q Q Q N N partially overlaps disjoint N N Generic Algorithm (1) • Two sets of nodes: • NP(partial contribution to the query) • NC (complete contribution)

Generic Algorithm (2) • Initialize NPwith the root • At each iteration: Remove one node N from NPand for each Nchildof its children • discard, if Nchild disjoint with Q • insert into NPif Q is contained or partially overlaps with Nchild • “insert” into NCif Q contains Nchild(we only need to maintain aggNC) N Q

Generic Algorithm (3) • To instantiate the algorithm for {MIN,MAX,COUNT,SUM,AVG}: • Error Bounds. • Interval I=[L, H] : L  aggQ H • Traversal Policy. • Which node from NP to explore next? Minimize |I| • Estimation. • Provide an estimate of the answer: E(aggQ) Node inNP Node in NC

Traversal Choose N  NP: minN = minNP MIN (and MAX) Interval minNC= min { 4, 5 }=4 minNP = min { 3, 9 } = 3 L = min {minNC, minNP} = 3 H = minNC = 4 hence, I = [3, 4] 9 4 5 Estimate Lower bound: E(minQ) = L = 3 3

Traversal Choose N  NP: countNcountM, MNP COUNT (and SUM) Interval countNC=9+6 = 15 countNP = 8+10 = 18 L = countNC = 15 H = countNC + countNP = 33 hence, I = [15, 33] 8 25% 9 Estimate E(countQ) = L + 0.258 + 0.210 = 19 6 20% 10

10 10 5 5 5 AVG Interval Current avgNC = 55/10 = 5.5 B Maximum possible: (55+210) / (10+2) = 6.25 Minimum possible: (55+35) / (10+3) = 5.38 hence, I = [5.38, 6.25] A Estimate E(avgQ) = E(sumQ)/ E(countQ) Distribution of Values {5, 5, 5, 10, 10} Traversal – max countN – max (maxN-avgNC), (avgNC-minN) min max count sum A 5 10 5 35 B – – 10 55

Experiments • Synthetic datasets 2-4D • Real datasets: 2D spatial (USGS) and 4D (UCI KDD Forest Cover) • MRA-quadtree and MRA-Rtree indices • We study • MRA-tree Vs. “plain” tree • MRA-tree Vs. online sampling • Accuracy of estimation • Scalability with database size

MRA-Quadtree (Nodes Visited)

MRA-Quadtree (Error Reduction) Absolute Relative Error =

MRA-Rtree (2D, USGS) I/O Performance DB Size =

Estimation vs. Maximum Error (4D, Forest Cover, sel. 16% / axis)

MRA-Rtree vs. Online SamplingEstimation Accuracy (4D, Forest Cover)

Database Size (3D Synthetic, exact, 10% spatial sel.)

Summary • MRA-Tree is a modified multi-dimensional index for approximate answering of aggregate queries • For exact answer • faster than “plain” index • Advantages over offline estimators • Progressively improving answers • Error bounds • Advantages over sampling • Better estimate for same I/O • Algorithm scales gracefully with database size

Future Work (QUASAR Project, UC Irvine) • Scalability with high dimensionality, by using a dedicated high-D index structure • Scalability in high update rate environments • Approximate query processing of general SQL queries using dedicated data structures, similar to MRA-tree

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure

Presentation Transcript

Tree structure

Approximate Queries on Very Large Data

Supporting Location-Based Approximate-Keyword Queries

Approximate Nearest Neighbor Queries with a Tiny Index

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

RDF Aggregate Queries and Views

Multi-resolution Resource Behavior Queries Using Wavelets

Approximate NN queries on Streams with Guaranteed Error/performance Bounds

Answering Approximate Queries Efficiently

Pushing Aggregate Constraints by Divide-and-Approximate

Interactive Dynamic Aggregate Queries

Approximate Selection Queries over Imprecise Data

Tree – a Hierarchical Data Structure

Multi-Resolution Analysis

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Multi-resolution Analysis

Aggregate Queries in Microsoft access database

Answering Approximate Queries Efficiently

Tree structure

Interactive Dynamic Aggregate Queries