240 likes | 375 Views
Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure. Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD 2001, Santa Barbara. Talk Outline. Aggregate Queries Motivation for Approximate Answering Multi-Resolution Aggregate Tree (MRA-Tree)
E N D
Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD 2001, Santa Barbara
Talk Outline • Aggregate Queries • Motivation for Approximate Answering • Multi-Resolution Aggregate Tree (MRA-Tree) • Progressive Algorithm with Error Bounds • Experimental Evaluation • Summary and Future Work
9 6 3 8 2 7 Q Aggregate Queries minQ = 2 maxQ = 7 countQ = 3 sumQ = 2+7+6 = 15 avgQ = 15/3 = 5 S
Evaluating Aggregate Queries • Exact answering • Scan all points of D checking each against Q • Retrieve points in Q via a multi-dimensional index on D • Both linear/index scan can be very expensive • Approximate answering • Many applications (selectivity estimation, data analysis, visualization) do not require exact answers
Boss How many tanks 10 miles from me? Motivating Examples My boss needs to see the income aggregates in 10 minutes!
Techniques for Approximate Aggregate Queries • Online estimation (Interactive) • Sampling • Offline estimation (Data Synopsis) • Sampling, Histograms, Wavelets • Our Technique: • Online estimator via a scan of a modified multi-dimensional index (MRA-Tree) • Allows incremental tradeoff of accuracy for response time, with guaranteed error bounds
Multi-Resolution Aggregate Tree (MRA-Tree) • An MRA-Tree can be instantiated with any of the popular multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.) • A non-leaf node contains (for each of its subtrees) four aggregates {MIN,MAX,COUNT,SUM} • A leaf node contains the actual data points • Tree operations are identical with those of the plain (non-MRA) tree with the consideration that aggregates must be maintained
6 2 2 1 2 4 3 5 4 1 MRA-Tree Example 2 min max count sum 4 1 6 Non-Leaf Node 4 5 2 6 3 2 4 1 9 9 4 6 Leaf Nodes
Progressive Algorithm Outline • We want • Best answer for given time • Shortest time for given precision of the answer • Refine an answer at will, trading time for precision • How we achieve it • Do a prioritized traversal of nodes of the MRA-tree • Maintain an estimate of the answer E(aggQ) • Maintain a 100% interval of confidence I = [L, H], such that L aggQ H
is contained contains Q Q Q Q N N partially overlaps disjoint N N Generic Algorithm (1) • Two sets of nodes: • NP(partial contribution to the query) • NC (complete contribution)
Generic Algorithm (2) • Initialize NPwith the root • At each iteration: Remove one node N from NPand for each Nchildof its children • discard, if Nchild disjoint with Q • insert into NPif Q is contained or partially overlaps with Nchild • “insert” into NCif Q contains Nchild(we only need to maintain aggNC) N Q
Generic Algorithm (3) • To instantiate the algorithm for {MIN,MAX,COUNT,SUM,AVG}: • Error Bounds. • Interval I=[L, H] : L aggQ H • Traversal Policy. • Which node from NP to explore next? Minimize |I| • Estimation. • Provide an estimate of the answer: E(aggQ) Node inNP Node in NC
Traversal Choose N NP: minN = minNP MIN (and MAX) Interval minNC= min { 4, 5 }=4 minNP = min { 3, 9 } = 3 L = min {minNC, minNP} = 3 H = minNC = 4 hence, I = [3, 4] 9 4 5 Estimate Lower bound: E(minQ) = L = 3 3
Traversal Choose N NP: countNcountM, MNP COUNT (and SUM) Interval countNC=9+6 = 15 countNP = 8+10 = 18 L = countNC = 15 H = countNC + countNP = 33 hence, I = [15, 33] 8 25% 9 Estimate E(countQ) = L + 0.258 + 0.210 = 19 6 20% 10
10 10 5 5 5 AVG Interval Current avgNC = 55/10 = 5.5 B Maximum possible: (55+210) / (10+2) = 6.25 Minimum possible: (55+35) / (10+3) = 5.38 hence, I = [5.38, 6.25] A Estimate E(avgQ) = E(sumQ)/ E(countQ) Distribution of Values {5, 5, 5, 10, 10} Traversal – max countN – max (maxN-avgNC), (avgNC-minN) min max count sum A 5 10 5 35 B – – 10 55
Experiments • Synthetic datasets 2-4D • Real datasets: 2D spatial (USGS) and 4D (UCI KDD Forest Cover) • MRA-quadtree and MRA-Rtree indices • We study • MRA-tree Vs. “plain” tree • MRA-tree Vs. online sampling • Accuracy of estimation • Scalability with database size
MRA-Quadtree (Error Reduction) Absolute Relative Error =
MRA-Rtree (2D, USGS) I/O Performance DB Size =
Estimation vs. Maximum Error (4D, Forest Cover, sel. 16% / axis)
MRA-Rtree vs. Online SamplingEstimation Accuracy (4D, Forest Cover)
Summary • MRA-Tree is a modified multi-dimensional index for approximate answering of aggregate queries • For exact answer • faster than “plain” index • Advantages over offline estimators • Progressively improving answers • Error bounds • Advantages over sampling • Better estimate for same I/O • Algorithm scales gracefully with database size
Future Work (QUASAR Project, UC Irvine) • Scalability with high dimensionality, by using a dedicated high-D index structure • Scalability in high update rate environments • Approximate query processing of general SQL queries using dedicated data structures, similar to MRA-tree