270 likes | 566 Views
Parallel Skyline Queries. Foto Afrati Paraschos Koutris Dan Suciu Jeffrey Ullman. University of Washington. What is The Skyline?. A d-dimensional set R A point x dominates x ’ if forall k : x (k) ≤ x ’(k) The skyline of R are all non-dominated points of R. skyline.
E N D
Parallel Skyline Queries FotoAfrati ParaschosKoutris Dan Suciu Jeffrey Ullman University of Washington
What is The Skyline? • A d-dimensional set R • A point x dominatesx’ if forall k:x(k) ≤ x’(k) • The skyline of Rare all non-dominated points of R skyline domination
Contributions • We design algorithms for Skyline Queries based on two parallel models: • MP: perfectload balancing [Koutris, Suciu ‘11] • GMP: weakerload balancing [Afrati, Ullman ’10] • We present 3 algorithms with theoretical guarantees for: • #synchronization steps • load balance
Previous Approaches • Several efficient algorithms for skyline queries exist in the literature • Parallel algorithms use various partitionings: • Grid-based partitioning [WZFZAA ’06] • Random partitioning [CRZ ’07] • Angle-based space partitioning [VDK ’08] • Hyperplaneprojections [KYZ ’11] • Previous approaches typically require a logarithmic number of communication steps: our algorithms achieve 1 or 2 steps
Massively Parallel Models • P servers: R partitioned into R1,R2,…, RP • n = |R| • The algorithm alternates between communication and computation steps • MP model: each node holds O(n/P) data • GMP model: each node holds O(Pε* n/P) where 0 ≤ε< 1 • ε =0 : GMP = MP • ε =1 : GMP = sequential computation in one node
An Example • How do we compute set intersection in one step in the MP model? • Hash each value x (from R or S) to a server Intersection Q(x):-R(x),S(x) Communication Phase send tuple R(x) to server @h(x) send tuple S(x) to server @h(x) Computation Phase output a tuple only if it occurs twice
The Broadcast Step • In addition to regular communication steps, we allow broadcast steps: • the data exchanged is independent of n • Known results: • Q(x,y)=R(x),S(x,y) can be computed in 1 MP step iff a broadcast step is allowed [Koutris, Suciu ‘11] • Q(x,y)=R(x),S(x,y),T(y) cannot be computed in 1 MP step [Koutris, Suciu ‘11] , but can be in 1 GMP step with ε=1/2 [Afrati, Ullman ‘10]
Outline of our Approach • Broadcast • Grid-based partitioning into cells • Pre-processing the cells to compute the relaxed skyline • Communication • Careful distribution of the cells (with their data) to the servers • Computation: • Localcomputation of the skyline at each server
Algorithm: Local: each server evenly partitions its data to M buckets Broadcast: servers exchange MxP partition points Local: each server picks every P-thvalue as partition point Bucketizing • Partition Rinto M buckets across some dimension, such that each partition contains O(n/M) points • Equivalently, compute (M+1) partition points: -∞ = b0 , b1 , … , bM = +∞ M=P or P1/(d-1) bucketize across dimension 1 bucketize across dimension 2
Cells • A cell is an intersection of buckets from all dimensions • Every point belongs in exactly one cell • Every cell holds O(n/P) data (and not O(n/Pd) !!) In each cell, we can keep only candidates forskyline points candidate rejected
Outline of our Approach • Broadcast • Grid-based partitioning into cells • Pre-processing the cells to compute the relaxed skyline • Communication • Careful distribution of the cells (with their data) to the servers • Computation: • Localcomputation of the skyline at each server
Cells • We are interested in the non-empty cells • Any cell that is strictly dominated by another does not contribute to the skyline no points belong in the final skyline strict domination domination
Relaxed Skyline of Cells • The relaxed skyline consists of the non-empty cells that are not strictly dominated by non-empty cells • We focus on the relaxed skyline of non-empty cells relaxed skyline skyline
On Relaxed Skylines • To compute the skyline points of a cell B, we need to compare with cells that: • belong in the relaxed skyline • weakly dominate B (have one common coordinate) cell B
Outline of our Approach • Broadcast • Grid-based partitioning into cells • Pre-processing the cells to compute the relaxed skyline • Communication • Careful distribution of the cells (with their data) to the servers • Computation: • Localcomputation of the skyline at each server
A NaïveApproach • Try the following: • Partition into P buckets (M=P) • Allocatecells in the relaxed skyline to servers + cells that weakly dominate them: O(n/P) data per cell • Locally compute the skyline points • This works if the relaxed skyline is small • But the relaxed skyline can have as many as Ω(Pd-1)cells for dimension d
A 1-step Algorithm • Choose a coarserbucketization (<P buckets) • This gives a weak load-balancedalgorithmwith maximum load of O( (n/P) P(d-2)/(d-1) ) • ε = (d-2)/(d-1) (ε=0 implies GMP=MP) Corollary. For d=2 dimensions, we obtain a perfectly load balanced algorithm for MP
A 2-step Algorithm • Step 1: group the cells in the relaxed skyline by bucket for every dimension Server 1 Server 2 … … Server 2 Server 1
A 2-step Algorithm • For each bucket, compute the local skyline • A point is a skyline point iff it is a local skyline point in every one of the d buckets • Step 2: intersect the local skylines This point is in the skyline of the y-bucket, but not the x-bucket x-bucket y-bucket
A 1-Step Algorithm for 3D Key idea: to reject this point, we only need the minimum x-coordinatefrom cell B cell B
A 1-Step Algorithm for 3D • The observation reduces the number of points that need to be communicated • With smart partitioning, we can achieve perfect load-balancein 1 step • However, the property holds only for 2 and 3 dimensions
Conlusion 3 algorithms for Skyline Queries: • 2 step + perfect load balance • 1 step + some replication • 1 step + perfect load balance for d < 4 Open Questions • Can we compute the skyline in 1 step with perfect load balance for >3 dimensions? • A more general question: what classes of queries can we compute in the MP model with perfect or weaker load balance guarantees?
Interior Cells • Two cells are co-linear if they share exactly two coordinates • A cell i is interior if every colinear cell in Sr(J) belongs in the same hyperplane as i. Else, it is a corner cell. • Interior cells are easy to handle: we can send the whole plane to a single processor
Corner Cells • We group the corner cells into lines • Border cells are the minimal/maximal cells of each line • Fact: lines meet only on border cells • Grouping: each line is a group, a cell is assigned to the lexicographically first line it belongs to
Assigning the groups • We have two ways to assign groups to servers • The first is deterministic and greedily assigns a group to any server that is not overloaded (M=P) • The second is randomized and sends each group randomly to some server (M = P log P)
About the MP model • [KS11] A dichotomy result on Conjunctive Queries that can be computed in 1 step with perfect load balancing • Easy Queries: • Q(x,y,z) :- R(x,y) , S(y,z) • Q(x,y,z,) :- R(x), S(x,y), T(x,y,z) • Hard Queries: • Q(x,y) :- R(x), S(x,y), T(y) • Q(x,y) :- R(x), S(x), T(y)