1 / 20

Efficient Skyline Computation in MapReduce

Efficient Skyline Computation in MapReduce. Kasper Mullesgaard , Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of S outhern Denmark. Skyline Query. Application: multi-criteria decision Tuple dominance: t1 dominates t2 (t1 ⊰ t2)

suki-reid
Download Presentation

Efficient Skyline Computation in MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Skyline Computation in MapReduce Kasper Mullesgaard, Jens LauritsPedersen, HuaLu Aalborg University Yongluan Zhou University of Southern Denmark

  2. Skyline Query • Application: multi-criteria decision • Tuple dominance: t1 dominates t2 (t1 ⊰ t2) • Iff t1 is not worse than t2 in all dimensions, and • t1 is better than t2 in at least one dimension • Skyline query: • Given a dataset, returns all tuples that are not dominated by others

  3. Scaling Skyline Computation • Customized solutions: • Require arbitrary inter-node communication • Need software stacks to hardness a large cluster • Unproved scalability • Lack of fault tolerance • General MapReduce platforms • Availability of scalable systems, such as Hadoop • A strict communication/synchronization model

  4. MapReduce

  5. Challenges of Skyline Computation using MapReduce • To maximize parallelization • Push more work to mappers, i.e. let mappers filter out more non-skyline points • Ability to utilize multiple reducers • However, global skylines cannot be determined by local information • Without global information, Mappers have very limited capabilities to filter out non-skyline points

  6. Grid Partitioning and Bit String Representation 2 5 8 BSR = 011110100 1 4 7 Partition Dominance: pi ⊰ pjiffpi.max ⊰ pj.min 0 3 6

  7. Bit String Generation

  8. Determining Partitions Per Dimension (PPD) • PPD is too high → very few tuples in each partition and too many partitions • PPD is too low → too many tuples in each partition and less effective pruning • Idea: generate bit strings for PPD from 2 to • then choose the one with the most desirable number of tuples per partition

  9. Single Reducer

  10. Multi-Reducer • The single reducer still performs significant work for detecting global skyline • limits the degree of parallelization • Idea: independent partition group • Anti-Dominating Region (ADR): • Independent Partition Group: A set of partitions Pi is an IPGiff holds • One reducer is responsible for each IPG.

  11. Multi-Reducer

  12. Generation of I.P.G. • Idea: a partition pm is a maximum partition iff∀p, pm ∉ p.ADR • Procedure: • Find a maximum partition pm • Generate IPG = {pm} U pm.ADR • Remove pm and repeat 1

  13. Implementation Issues • More independent groups than #reducers • Need allocate them to the reducers, two options: • Load balancing • Minimizing duplicate data transmission • Elimination of duplicated skyline outputs • A grid partition appears in multiple IPGs • Designate one IPG as the responsible group • Load balancing

  14. Experimental Setup • 13 commodity machines • Datasets with independent and anti-correlated distribution • Comparisons: • MR-BNL • MR-Angle

  15. #Dimensions independent data, cardinality: 1×105

  16. #Dimensions Anti-correlated data, cardinality: 1×105

  17. Cardinality (independent data) Dimensions: 3 Dimensions: 8

  18. Cardinality (Anti-corr. data) Dimensions: 3 Dimensions: 8

  19. Number of Reducers

  20. Summary • Grid partitioning and bit strings • Choose an appropriate # partitioning • Exploit independent groups to enable multiple reducers • Good for cases with large # skylines • Merging independent groups • Eliminate duplicate outputs

More Related