200 likes | 403 Views
Efficient Skyline Computation in MapReduce. Kasper Mullesgaard , Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of S outhern Denmark. Skyline Query. Application: multi-criteria decision Tuple dominance: t1 dominates t2 (t1 ⊰ t2)
E N D
Efficient Skyline Computation in MapReduce Kasper Mullesgaard, Jens LauritsPedersen, HuaLu Aalborg University Yongluan Zhou University of Southern Denmark
Skyline Query • Application: multi-criteria decision • Tuple dominance: t1 dominates t2 (t1 ⊰ t2) • Iff t1 is not worse than t2 in all dimensions, and • t1 is better than t2 in at least one dimension • Skyline query: • Given a dataset, returns all tuples that are not dominated by others
Scaling Skyline Computation • Customized solutions: • Require arbitrary inter-node communication • Need software stacks to hardness a large cluster • Unproved scalability • Lack of fault tolerance • General MapReduce platforms • Availability of scalable systems, such as Hadoop • A strict communication/synchronization model
Challenges of Skyline Computation using MapReduce • To maximize parallelization • Push more work to mappers, i.e. let mappers filter out more non-skyline points • Ability to utilize multiple reducers • However, global skylines cannot be determined by local information • Without global information, Mappers have very limited capabilities to filter out non-skyline points
Grid Partitioning and Bit String Representation 2 5 8 BSR = 011110100 1 4 7 Partition Dominance: pi ⊰ pjiffpi.max ⊰ pj.min 0 3 6
Determining Partitions Per Dimension (PPD) • PPD is too high → very few tuples in each partition and too many partitions • PPD is too low → too many tuples in each partition and less effective pruning • Idea: generate bit strings for PPD from 2 to • then choose the one with the most desirable number of tuples per partition
Multi-Reducer • The single reducer still performs significant work for detecting global skyline • limits the degree of parallelization • Idea: independent partition group • Anti-Dominating Region (ADR): • Independent Partition Group: A set of partitions Pi is an IPGiff holds • One reducer is responsible for each IPG.
Generation of I.P.G. • Idea: a partition pm is a maximum partition iff∀p, pm ∉ p.ADR • Procedure: • Find a maximum partition pm • Generate IPG = {pm} U pm.ADR • Remove pm and repeat 1
Implementation Issues • More independent groups than #reducers • Need allocate them to the reducers, two options: • Load balancing • Minimizing duplicate data transmission • Elimination of duplicated skyline outputs • A grid partition appears in multiple IPGs • Designate one IPG as the responsible group • Load balancing
Experimental Setup • 13 commodity machines • Datasets with independent and anti-correlated distribution • Comparisons: • MR-BNL • MR-Angle
#Dimensions independent data, cardinality: 1×105
#Dimensions Anti-correlated data, cardinality: 1×105
Cardinality (independent data) Dimensions: 3 Dimensions: 8
Cardinality (Anti-corr. data) Dimensions: 3 Dimensions: 8
Summary • Grid partitioning and bit strings • Choose an appropriate # partitioning • Exploit independent groups to enable multiple reducers • Good for cases with large # skylines • Merging independent groups • Eliminate duplicate outputs