Efficient Skyline Computation in MapReduce

Efficient Skyline Computation in MapReduce Kasper Mullesgaard, Jens LauritsPedersen, HuaLu Aalborg University Yongluan Zhou University of Southern Denmark

Skyline Query • Application: multi-criteria decision • Tuple dominance: t1 dominates t2 (t1 ⊰ t2) • Iff t1 is not worse than t2 in all dimensions, and • t1 is better than t2 in at least one dimension • Skyline query: • Given a dataset, returns all tuples that are not dominated by others

Scaling Skyline Computation • Customized solutions: • Require arbitrary inter-node communication • Need software stacks to hardness a large cluster • Unproved scalability • Lack of fault tolerance • General MapReduce platforms • Availability of scalable systems, such as Hadoop • A strict communication/synchronization model

MapReduce

Challenges of Skyline Computation using MapReduce • To maximize parallelization • Push more work to mappers, i.e. let mappers filter out more non-skyline points • Ability to utilize multiple reducers • However, global skylines cannot be determined by local information • Without global information, Mappers have very limited capabilities to filter out non-skyline points

Grid Partitioning and Bit String Representation 2 5 8 BSR = 011110100 1 4 7 Partition Dominance: pi ⊰ pjiffpi.max ⊰ pj.min 0 3 6

Bit String Generation

Determining Partitions Per Dimension (PPD) • PPD is too high → very few tuples in each partition and too many partitions • PPD is too low → too many tuples in each partition and less effective pruning • Idea: generate bit strings for PPD from 2 to • then choose the one with the most desirable number of tuples per partition

Single Reducer

Multi-Reducer • The single reducer still performs significant work for detecting global skyline • limits the degree of parallelization • Idea: independent partition group • Anti-Dominating Region (ADR): • Independent Partition Group: A set of partitions Pi is an IPGiff holds • One reducer is responsible for each IPG.

Multi-Reducer

Generation of I.P.G. • Idea: a partition pm is a maximum partition iff∀p, pm ∉ p.ADR • Procedure: • Find a maximum partition pm • Generate IPG = {pm} U pm.ADR • Remove pm and repeat 1

Implementation Issues • More independent groups than #reducers • Need allocate them to the reducers, two options: • Load balancing • Minimizing duplicate data transmission • Elimination of duplicated skyline outputs • A grid partition appears in multiple IPGs • Designate one IPG as the responsible group • Load balancing

Experimental Setup • 13 commodity machines • Datasets with independent and anti-correlated distribution • Comparisons: • MR-BNL • MR-Angle

#Dimensions independent data, cardinality: 1×105

#Dimensions Anti-correlated data, cardinality: 1×105

Cardinality (independent data) Dimensions: 3 Dimensions: 8

Cardinality (Anti-corr. data) Dimensions: 3 Dimensions: 8

Number of Reducers

Summary • Grid partitioning and bit strings • Choose an appropriate # partitioning • Exploit independent groups to enable multiple reducers • Good for cases with large # skylines • Merging independent groups • Eliminate duplicate outputs

Efficient Skyline Computation in MapReduce

Efficient Skyline Computation in MapReduce

Presentation Transcript

Progressive Computation of Constrained Subspace Skyline Queries

Incoop : MapReduce for I ncremental Computation

HPMR : Prefetching and Pre-shufﬂing in Shared MapReduce Computation Environment

Design Patterns for Efficient Graph Algorithms in MapReduce

Design Patterns for Efficient Graph Algorithms in MapReduce

Efficient Parallel kNN Joins for Large Data in MapReduce

Efficient Non-Interactive Secure Computation

Towards Energy Efficient MapReduce

Parallel Computation of Skyline Queries Verification

Efficient Computation of Reverse Skyline Queries

Efficient computation of photohadronic interactions

Parallel Skyline Computation on Multicore Architectures

Scalable Skyline Computation Using Object-based Space Partitioning

A Model of Computation for MapReduce

Efficient Skyline Computation on Vertically Partitioned Datasets

Efficient Processing of Metric Skyline Queries

Limits on Efficient Computation in the Physical World

Top-k and Skyline Computation

Limits on Efficient Computation in the Physical World

Top- k and Skyline Computation in Database Systems

The Limits of Efficient Computation

A Model of Computation for MapReduce