260 likes | 401 Views
Spatio-Temporal Aggregation Using Sketches. Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer Science City University of Hong Kong, Boston University, Hong Kong University of Science and Technology 18, March, 2004. Outline.
E N D
Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer Science City University of Hong Kong, Boston University, Hong Kong University of Science and Technology 18, March, 2004
Outline • Applications and motivation • Preliminaries –Aggregate trees and sketch techniques • Distinct spatio-temporal aggregation • Performance study • Extensions • Conclusion
Spatio-Temporal Aggregate Query -- Applications • Traffic Supervision Systems • Monitoring the number of vehicles in a district, the information could be used to identify the traffic jam area etc. • Mobile Computing Applications • Allocating bandwidth depending on the usage of each region Example: For wireless companies, they would like to know the number of cell phone users in a particular region in a specified period. In addition, it is also interesting to know the total number of phone calls made by all users who qualified the first query.
Spatio-Temporal Aggregate Query • Spatio-Temporal Application requires the retrieval of summarized information about moving objects • Given an aggregate query region as a rectangle qr and query interval qt, a spatio temporal aggregate query retrieves information about objects that appeared in qr during qt • Spatio-Temporal Count • Returns the total number of qualifying objects • Spatio-Temporal Sum • Each object associated with a measure, outputs the sum of the measures of the qualifying objects. • Existing Approach: multi-tree structures based on R-trees and B-trees • Problem: If an object remains in the query region for several timestamps during the query interval, it will be counted (or summed ) multiple times in the result.
Motivation: Distinct Spatio-Temporal Aggregate Query Enable a much richer range of decision-making queries But: There is no way to exactly summarize distinct objects substantially better than by simply enumerating all of them How to answer “Distinct Aggregate Query” ? e.g: How many cars are present in a district? Sketch Techniques Solution: Spatio-Temporal Aggregation Index Trees Spatio-Temporal Aggregate Query (cont.)
regions r 132 127 125 127 127 4 R r 1 2 r r 12 12 12 12 12 r 3 3 1 r 4 r 75 80 85 90 90 2 R q 2 r r 150 150 145 135 130 1 5 1 2 3 4 time Example Query retrieve the aggregate sum (during time T1-T3) of all rectangles that intersect it.
Preliminaries -- Aggregate RB-tree In the aRB-tree, the extents of all regions (in this case r1,r2,…,r4) are stored in an R-tree. Each (leaf/non-leaf) entry of the R-tree is associated with a pointer to a B-tree that stores historical aggregate data about the entry
Preliminaries – Flajolet-Martin sketches • Goal: Small-space representation of a set of items. Prerequisite:Let h be a random, binary hash function. Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, Compute h (x, i). Stop when h (x, i) = 1, and set bit i. ∩ • Sketch of a union of items is the OR of their bitmaps.
Preliminaries – Flajolet-Martin sketches (cont.) Estimating COUNT Take the bitmap of a set of N items. Let j be the position of the leftmost zero in the bitmap. j is an estimator of log2 (0.77 N) S 1 1 1 0 1 j = 3 Best guess: COUNT ~ 11 • Fixable drawbacks: • Variance in the estimate is large.
Preliminaries – Flajolet-Martin sketches (cont.) • Standard variance reduction methods apply. • Compute m independent bitmaps in parallel. • Generate m independent estimates of N. • Take the mean of the estimates. • Provable tradeoffs between m and variance of the estimator.
Distinct Spatio-Temporal Aggregation Exact Solution If n is the number of distinct objects and T is the total number of timestamps in history, the exact solution requires W(n∙T) space. Existing Aggregation Approach aRB tree stores only the summarized data, information about individual objects is lost and the problem cannot be solved. • Our Solution • Combining aRB tree with FM sketch technique! • For each region ri and every timestamp t we maintain a sketch si(t) that captures the (ids of) objects in ri at t. • Requires O(m∙R∙T∙logn) space. • where R is the number of regions and m is an adjustable constant specifying • number of bitmaps used by one sketch. (determines the tradeoff between overhead and approximation accuracy)
System Architecture The sketches can be stored in a two dimensional array
R-tree for the spatial dimensions R R 2 1 r r r r R r 4 2 3 1 1 2 B-tree for R B-tree for R 1 2 N r 1 r 3 1 1 11000 4 11111 1 11100 4 11101 r 4 N 2 1 11000 3 10000 4 10100 5 11111 1 10100 2 11100 4 11100 5 10101 R B-tree for r B-tree for r q 2 1 3 r 1 11000 5 11111 1 11100 4 11101 1 01000 2 10000 5 11111 1 10000 2 01100 4 11100 5 10100 B-tree for r B-tree for r 2 4 N 3 1 11100 4 11101 1 11000 3 10100 N 4 1 10100 3 10000 4 11000 5 10001 1 10000 2 11000 3 10000 4 10100 Sketch Indexing Structures The sketch of a non-leaf entry in B-tree equals to the OR of all the sketches in its sub-trees. qt=(1,4) <time, sketch>
Query Processing • Similar to the query processing technique in aRB tree. Basic Idea: The spatial and temporal searching conditions are applied alternatively. The result sketch is incrementally updated. • Can be improved by applying some pruning techniques. Heuristic 1: Let RS be the current result sketch, and e a non-leaf B-tree entry whose associated sketch is se. Then, the sub-tree of e can be pruned if (se OR RS)= RS. Heuristic 2: Given a set of entries that cannot be pruned by Heuristic 1, we visit their child nodes in descending order of the number of 1’s in their sketches. And more heuristics!
Query Processing – Supporting Distinct Sum Query Extending FM sketches • FM sketches can handle this : • - to insert a value of 500, perform 500 distinct item insertions • Our observation: We can simulate a large number of insertions into an FM sketch more efficiently.
Performance • Dataset settings • Number of cities = 10,000 • Number of buses = 100,000 • History length = 1,00 timestamps • Number of passengers for each bus = [200,300] • At each timestamp, bus reports to its nearest city, <time t, city c, bus b, passenger # a> • Each query contains 2 parameters: (spatial extents and interval length) • A count query retrieves the number of distinct buses that report to cities in qr during qt, while a sum query returns the sum of these buses’ passengers • Compare the sketch-index to the relational approach: index the 4-tuple table <t,c,b,a> using a B-tree on the time t column
160 size (mega bytes) 140 120 100 80 60 40 20 0 database 8 16 32 size number of bitmaps per sketch Results (Space Consumption) Size of sketch index could be further reduced by applying simple compression techniques!
Results (Sketch Pruning in Query) (a) Cost vs. qrlen (qtlen=10)
Results (Sketch Pruning in Query) (b) Cost vs. qtlen (qrlen=0.15)
32-bitmap 8-bitmap 16-bitmap Results (Accuracy of Approximate Results) (a) Error vs. qrlen (qtlen=10, count)
32-bitmap 8-bitmap 16-bitmap Results (Accuracy of Approximate Results) (b) Error vs. qrlen ( qtlen=10, sum)
32-bitmap 8-bitmap 16-bitmap Results (Costs of Indexes) (a) Cost vs. qrlen (qtlen=10)
32-bitmap 8-bitmap 16-bitmap Results (Costs of Indexes) (b) Cost vs. qtlen (qrlen=0.15)
Extensions • Approximating general moving data Problem: Each object o reports its location <x,y> at each timestamp t, the size of the database grows continuously! O(n∙T) • Solution: Impose a resres regular grid over the data space, the sketch index is applied by treating the grid cells as the finest aggregate granularity. O((res)2∙T∙logn) [or, O(T∙logn) when res is a constant ]
Conclusion • We propose a sketch index that integrates traditional approximate counting techniques with spatio-temporal indexes for efficient distinct aggregation query processing in spatio-temporal database. • Sketch index consumes less space and give an order of magnitude faster query process with less aggregate error than a conventional database. • Extensions and Future work • Other possible sketches • More sophisticated algorithms for mining association rules