1 / 14

Towards Automatic Optimization of MapReduce Programs (Position Paper)

Towards Automatic Optimization of MapReduce Programs (Position Paper). Shivnath Babu Duke University. Roadmap. Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges & promising directions. …. Pig. Hive. JAQL. Hadoop. HDFS. Map function.

juro
Download Presentation

Towards Automatic Optimization of MapReduce Programs (Position Paper)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Automatic Optimization of MapReduce Programs(Position Paper) Shivnath Babu Duke University

  2. Roadmap • Call to action to improve automatic optimization techniques in MapReduce frameworks • Challenges & promising directions … Pig Hive JAQL Hadoop HDFS

  3. Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job

  4. Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job

  5. Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

  6. 190+ parameters in Hadoop Set manually or defaults are used Are defaults or rules-of-thumb good enough? Job Configuration Parameters

  7. Running time (seconds) Running time (minutes) Running time (minutes) Running time (seconds) Experiments On EC2 and local clusters

  8. Running time 10 500 0.15 Based on popular rule-of- thumb 28 10 0.15 300 10 0.15 300 500 0.15 Illustrative Result: 50GB Terasort 17-node cluster, 64+32 concurrent map+reduce slots • Performance at default and rule-of-thumb settings can be poor • Cross-parameter interactions are significant

  9. Multi-job workflows Energy considerations Declarative HiveQL/Pig operations Cost in pay-as-you-go environment Space of execution choices Complexity Is this where we want to be? Problem Space Job configuration parameters Performance objectives Current approaches: • Predominantly manual • Post-mortem analysis

  10. But: MapReduce jobs are not declarative No schema about the data Impact of concurrent jobs & scheduling? Space of parameters is huge Can DB Query Optimization Technology Help? Optimizer: Database Execution Engine • Enumerate • Cost • Search Query Good setting of parameters Good plan MapReduce job Results Hadoop Can we: • Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years • Or innovate! • Exploit design & usage properties of MapReduce frameworks

  11. AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO) Spectrum of Query Optimizers Conventional Optimizers Cost models + statistics about data Rule- based

  12. AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO) Spectrum of Query Optimizers Conventional Optimizers Tuning Optimizers (proactively try different plans) Learning Optimizers (learn from execution & adapt) Cost models + statistics about data Rule- based

  13. Exploit usage & design properties of MapReduce frameworks: High ratio of repeated jobs to new jobs Schema can be learned (e.g., Pig scripts) Common sort-partition-merge skeleton Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results) Fine-grained and pluggable scheduler Spectrum of Query Optimizers Conventional Optimizers Tuning Optimizers (proactively try different plans) Learning Optimizers (learn from execution & adapt) Cost models + statistics about data Rule- based

  14. Summary • Call to action to improve automatic optimization techniques in MapReduce frameworks • Automated generation of optimized Hadoop configuration parameter settings, HiveQL/Pig/JAQL query plans, etc. • Rich history to learn from • MapReduce execution creates unique opportunities/challenges

More Related