70 likes | 77 Views
An Overview of Map-Reduce Research. Main Themes. Designing Efficient Algorithms on Map-Reduce Extensions on Map-Reduce Modeling Map-Reduce Computation. Limitations. Selective Access To Data High Communication Cost Redundant and Wasteful Processing
E N D
Main Themes • Designing Efficient Algorithms on Map-Reduce • Extensions on Map-Reduce • Modeling Map-Reduce Computation
Limitations • Selective Access To Data • High Communication Cost • Redundant and Wasteful Processing • Lack of Early Termination • Lack of Iteration • Quick Retrieval of Approximate Results • Load Balancing • Lack of Real-time and Interactive Processing • Lack of Support for n-way Operations
Interactive Processing Streaming Pipelining In-Memory Processing Pre-computation Dremel, Tenzing, BlinkDB M3R, Shark Data Access Indexing Partitioning Co-location, Data Layout Co-Hadoop(*), Hadoop++, HAIL, LlAH, Llama, Cheetah Avoidance of Redundant Processing Batch Processing of Queries Result Materialization Incremental Processing Result Sharing ReStore, InCoop, MRShare Processing n-way Operations Spatial / Temporal Joins Additional MR Phase Redistribution of Keys Record Duplication Controlled-Replicate(*), RCCIS(*) Iterative Processing Looping, Caching Pipelining, Recursion Incremental Processing HaLoop, ReDoop, InCoop Extensions On Map-Reduce Query Optimization Parameter Tuning, Plan Refinement Operator Reordering, Code Analysis Data Flow Optimization HadoopDB, Clydesdale, Starfish, AQUA, Adaptive-MR(*) Processing Industry Specific Data Spatio - Temporal Data Geo-Spatial Data Agriculture / Oil & Gas / Energy BLAST(*), Spatial-Hadoop, Hadoop-GIS Fair Work Allocation Batching, Sampling, Re-partitioning Skew-Tune, Skew-Reduce, Themis Early Termination Sorting , Sampling EARL, RanKloud (*) – Contributed by IBM
Designing Efficient Algorithms on Map-Reduce • Joins • Multi-way Joins • Similarity Joins • Theta Joins • Spatial Joins • Interval Joins • Entity Resolution • Graph Algorithms • Machine Learning • Computational Geometry
Modeling Computation on Map-Reduce • Two main cost components • Time spent in communication from map tasks to reduce tasks • Time spent in computation as part of reduce tasks • These two components involve a trade-off • Given - an analytics problem, the input-data and the number of reduce tasks • What is the minimum communication cost, a map-reduce algorithm for the given analytics and the corresponding input-data is going to incur?
Survey References • A Survey on Large-Scale Analytical Query Processing in Map-Reduce • Christos Doulkeridis and Kjetil Norwag • In VLDB Journal, 23(3), 2014 • Distributed Data Management on Map-Reduce • Feng Li, Beng Chin Ooi, M. Tamer. Ojsu and Sai Wu • In ACM Computing Survey, 46(3), 2014