270 likes | 287 Views
Data Engineering How MapReduce Works. Shivnath Babu. Map function. Reduce function. Run this program as a MapReduce job. Lifecycle of a MapReduce Job. Map function. Reduce function. Run this program as a MapReduce job. Lifecycle of a MapReduce Job. Lifecycle of a MapReduce Job. Time.
E N D
Data EngineeringHow MapReduce Works Shivnath Babu
Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job
Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job
Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2
Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
Shuffle Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2 What if #reduces increased to 9? Reduce Wave 3 Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2
Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map SparkContext Cluster Manager User (Developer) Worker Node Worker Node Writes Executer Cache Executer Cache Task Task Task Task … Datanode Datanode HDFS Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12
Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using a rich set of operators RDD (Resilient Distributed Dataset) User (Developer) Writes
RDD • Programming Interface: Programmer can perform 3 types of operations • Transformations • Create a new dataset from an existing one. • Lazy in nature. They are executed only when some action is performed. • Example : • Map(func) • Filter(func) • Distinct() • Actions • Returns to the driver program a value or exports data to a storage system after performing a computation. • Example: • Count() • Reduce(funct) • Collect • Take() • Persistence • For caching datasets in-memory for future operations. • Option to store on disk or RAM or mixed (Storage Level). • Example: • Persist() • Cache()
How Spark works • RDD: Parallel collection with partitions • User application can create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a collection of Tasks (one Task for each Partition).
Summary of Components • Task : The fundamental unit of execution in Spark • Stage: Collection of Tasks that run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset of objects with partitions
Example sc.textFile(“/wiki/pagecounts”) RDD[String] textFile
Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) RDD[String] RDD[List[String]] map textFile
Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) RDD[String] RDD[List[String]] RDD[(String, Int)] map map textFile
Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey map map textFile
Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect reduceByKey
Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between Stage 2 Stage 1 collect reduceByKey map map textFile
Execution Plan Stage 2 Stage 1 collect Read shuffle data Final reduce Send result to driver program Read HDFS split Apply both the maps Start partial reduce Write shuffle data reduceByKey map map textFile Stage 2 Stage 1
Stage Execution • Create a task for each Partition in the input RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you don’t have to do anything) Task 1 Task 2 Task 2 Task 2
Spark Executor (Slave) Core 1 Core 2 Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Execute Task Execute Task Execute Task Execute Task Execute Task Execute Task Execute Task Core 3 Write Output Write Output Write Output Write Output Write Output Write Output Write Output