1 / 27

Data Engineering How MapReduce Works

Data Engineering How MapReduce Works. Shivnath Babu. Map function. Reduce function. Run this program as a MapReduce job. Lifecycle of a MapReduce Job. Map function. Reduce function. Run this program as a MapReduce job. Lifecycle of a MapReduce Job. Lifecycle of a MapReduce Job. Time.

gortega
Download Presentation

Data Engineering How MapReduce Works

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data EngineeringHow MapReduce Works Shivnath Babu

  2. Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job

  3. Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job

  4. Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2

  5. Job Submission

  6. Initialization

  7. Scheduling

  8. Execution

  9. Map Task

  10. Reduce Tasks

  11. Hadoop Distributed File-System (HDFS)

  12. Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

  13. Shuffle Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2 What if #reduces increased to 9? Reduce Wave 3 Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2

  14. Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map SparkContext Cluster Manager User (Developer) Worker Node Worker Node Writes Executer Cache Executer Cache Task Task Task Task … Datanode Datanode HDFS Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12

  15. Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using a rich set of operators RDD (Resilient Distributed Dataset) User (Developer) Writes

  16. RDD • Programming Interface: Programmer can perform 3 types of operations • Transformations • Create a new dataset from an existing one. • Lazy in nature. They are executed only when some action is performed. • Example : • Map(func) • Filter(func) • Distinct() • Actions • Returns to the driver program a value or exports data to a storage system after performing a computation. • Example: • Count() • Reduce(funct) • Collect • Take() • Persistence • For caching datasets in-memory for future operations. • Option to store on disk or RAM or mixed (Storage Level). • Example: • Persist() • Cache()

  17. How Spark works • RDD: Parallel collection with partitions • User application can create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a collection of Tasks (one Task for each Partition).

  18. Summary of Components • Task : The fundamental unit of execution in Spark • Stage: Collection of Tasks that run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset of objects with partitions

  19. Example sc.textFile(“/wiki/pagecounts”) RDD[String] textFile

  20. Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) RDD[String] RDD[List[String]] map textFile

  21. Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) RDD[String] RDD[List[String]] RDD[(String, Int)] map map textFile

  22. Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey map map textFile

  23. Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect reduceByKey

  24. Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between Stage 2 Stage 1 collect reduceByKey map map textFile

  25. Execution Plan Stage 2 Stage 1 collect Read shuffle data Final reduce Send result to driver program Read HDFS split Apply both the maps Start partial reduce Write shuffle data reduceByKey map map textFile Stage 2 Stage 1

  26. Stage Execution • Create a task for each Partition in the input RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you don’t have to do anything) Task 1 Task 2 Task 2 Task 2

  27. Spark Executor (Slave) Core 1 Core 2 Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Execute Task Execute Task Execute Task Execute Task Execute Task Execute Task Execute Task Core 3 Write Output Write Output Write Output Write Output Write Output Write Output Write Output

More Related