340 likes | 489 Views
Efficient Skyline Computation on Vertically Partitioned Datasets. Dimitris Papadias , David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong. Outline. Problem Statement Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm
E N D
Efficient Skyline Computation on Vertically Partitioned Datasets DimitrisPapadias, David Yang, GeorgiosTrimponiasCSE Department, HKUST, Hong Kong
Outline • Problem Statement • Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm • Algorithms for Top-k Query Processing • FM Sketches • Putting Everything Together
Outline • Problem Statement • Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm • Algorithms for Top-k Query Processing • FM Sketches • Putting Everything Together
A Motivating Example • Consider a database containing information about hotels. The y-dimension represents the price of the room, whereas the x-dimension captures the distance of the room from the beach. Dominance Region of p Price Skyline objects p Hotel rooms Borders of p’s Dominance Region Distance
Skyline Preliminaries[ICDE, 2001] • Skylines constitute a very useful tool in numerous disciplines, such as for multidimensional decision making and data mining. • Given a set of d-dimensional objects p1, …, pN, the skyline operator retrieves all these objects that are nor dominated by any other object in the set. • An object pi dominates another point pj, if it is not worse than pj in all dimensions and better than it in at least one dimension. • Properties: • The top-1 tuple according to any preference function that assigns scores to tuples is in the skyline tuple. • Conversely, for any skyline tuple, there exists a preference function according to which it is the top-1.
Problem Definition • Compute the skyline when the dataset is vertically decomposed among a set of N servers. • Goal: minimize the data that must be retrieved from each server. • We assume wireless environments, where communication overhead constitutes the dominant factor in battery consumption. • Consider mobile phone applications as real world examples.
First Observations • The global skyline may contain points that do not appear in the local skylines. • Instead of transmitting all records over the network, avoid sending out points that are guaranteed to be dominated globally by an anchor point.
Outline • Problem Statement • Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm • Algorithms for Top-k Query Processing • FM Sketches • Putting Everything Together
Balke’s Algorithm[EDBT, 2004] • Assume that the d-dimensional database is vertically partitioned into d lists, one for each dimension, assigned to different servers. The lists contain values in ascending order. • Idea: perform sorted accesses on the d lists in a round-robin manner, until a point p (anchor), is reached in every list. • Points that have not showed up at this moment in any list can be safely pruned, since they are dominated by the anchor.
Example • Let a 2-dimensional database with the following two lists: • L1 • L2 … …
Example (cont.) • Let a 2-dimensional database with the following two lists: • L1 • L2 … … The first point to be retrieved from both lists.
Example (cont.) • Let a 2-dimensional database with the following two lists: • L1 • L2 … … The first point to be retrieved from both lists. These points cannot be part of the skyline.
Further Improvement • Efficiency can be improved, if instead of visiting the lists in a round-robin manner, we access the most promising list with random accesses. • As a result, only the least expansion is performed on each list. avoid visiting these points ∙ ∙ ∙ P ∙ ∙ ∙ L1 ∙ ∙ ∙ P ∙ ∙ ∙ L2
Outline • Problem Statement • Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm • Algorithms for Top-k Query Processing • FM Sketches • Putting Everything Together
Setting • Let N1, .., Nm be m servers storing the same dataset DB. • For each record PDB every server Ni maintains a local scoresi(P), and sorts all records in decreasing order of their local scores. • A client wishes to obtain the k records of DB with the maximum global score s. • The score is computed using a monotonic function f on the local scores, i.e., s(P) = f(s1(P), .., sm(P)). • Goal: minimize the required number of accesses.
Fagin’s Algorithm[PODS, 2001] • Each server Ni performs sorted round-robin accesses and sends to the client the next record and its local score. • When the first common record Panc is encountered by all servers, the client terminates the sorted accesses. • Then, it obtains the missing local scores of the other encountered points through random accesses. • The candidate with the highest global score is the top-1 result.
Threshold Algorithm[PODS, 2001] • It utilizes an upper bound TA on the global score to terminate earlier than FA. • The client retrieves the local scores of newly encountered points with random accesses at the remaining servers and computes their global scores, and picks the best score sbest. • The threshold TA is equal to the sum of the local thresholds at each server. • As long TA > sbest, TA continues the sorted accesses, while it keeps updating TA. • Eventually, the top-1 point will be returned.
Best Position Algorithm[VLDB, 2007] • It further improves TA by utilizing a tighter threshold. • Let bpi be the position at server Ni such that all points up to bpi have been encountered through sorted or random accesses. • The global threshold BP is equal to the sum of the local thresholds at bpi.
Outline • Problem Statement • Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm • Algorithms for Top-k Query Processing • FM Sketches • Putting Everything Together
Flajolet / Martin sketches[JCSS ’85] • Goal: Estimate the distinct number of objects from a small-space representation of a set. • Sketch of a union of items is the OR of their sketches • Insertion order and duplicates don’t matter! Prerequisite:Let h be a random, binary hash function. Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, Compute h (x, i). Stop when h (x, i) = 1, and set bit i. ∩
Flajolet / Martin sketches (cont.) Estimating COUNT Take the sketch of a set of N items. Let j be the position of the leftmost zero in the sketch. j is an estimator of log2 (0.77 N) S 1 1 1 0 1 j = 3 Best guess: COUNT ~ 11 • Fixable drawbacks: • Estimate has faint bias • Variance in the estimate is large.
Flajolet / Martin sketches (cont.) • Standard variance reduction methods apply. • Compute m independent sketches in parallel. • Compute m independent estimates of N. • Take the mean of the estimates. • Provable tradeoffs between m and variance of the estimator.
S1∪S2∪S3 S4 S4 S3 S2 S1 S1 S2 Application to COUNT in Sensor Databases • Each sensor computes k independent sketches of itself (using unique ID x) • sensor computes a sketch of its value. • Use a robust routing algorithm to route sketches up to the sink. • Aggregate the k sketches via union en-route. (OR) • The sink then estimates the count. S1∪S2∪S3∪S3 sink
Outline • Problem Statement • Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm • Algorithms for Top-k Query Processing • FM Sketches • Putting Everything Together
Problem Characteristics • Each vertical decomposition has arbitrary dimensionality, contrary to Balke’s setting. • Anchor selection substantially determines the total number of transmitted data. • VPS adopts sorting on the local dominance. In particular, the local dominance count domi(P)of a point P with respect to subspace Di is the number of points dominated by P in Di. • Balke selects as the anchor, the data point P with the maximal domSUM(P). • We utilize a tighter upper bound for dom(P) is the minimum domMIN among all local dominance counts.
Anchor Selection C: optimal anchor point A: has maximal domMIN B: has maximal domSUM
1st Optimization: Multiple Anchor Points • The previous algorithm performs pruning with a single anchor Panc. Specifically, a point P that is locally dominated by Panc in all subspaces is not sent to the client. • On the other hand, if P is incomparable with Panc even in a single subspace Di, it will be transmitted by the corresponding server Ni. • We suggest that multiple points can often achieve more effective pruning.
2nd Optimization: Integration of Sketches • So far, we have estimated the (expected) global dominance dom(P) of a point P using domMIN(P). • This approach is biased towards points that have high local dominance counts in all subspaces, but dominate few records globally (A). • Thus, we propose an unbiased approach that directly estimates the global dominance counts using sketches that count the number of distinct objects approximately. • We assume that each Ni server has a local dominance sketchski(P) for every point P, which aggregates all points that P dominates locally in Di.