Data Engineering How MapReduce Works

Data EngineeringHow MapReduce Works Shivnath Babu

Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job

Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2

Job Submission

Initialization

Scheduling

Execution

Map Task

Reduce Tasks

Hadoop Distributed File-System (HDFS)

Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

Shuffle Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2 What if #reduces increased to 9? Reduce Wave 3 Map Wave 1 Map Wave 2 Map Wave 3 Reduce Wave 1 Reduce Wave 2

Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map SparkContext Cluster Manager User (Developer) Worker Node Worker Node Writes Executer Cache Executer Cache Task Task Task Task … Datanode Datanode HDFS Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12

Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using a rich set of operators RDD (Resilient Distributed Dataset) User (Developer) Writes

RDD • Programming Interface: Programmer can perform 3 types of operations • Transformations • Create a new dataset from an existing one. • Lazy in nature. They are executed only when some action is performed. • Example : • Map(func) • Filter(func) • Distinct() • Actions • Returns to the driver program a value or exports data to a storage system after performing a computation. • Example: • Count() • Reduce(funct) • Collect • Take() • Persistence • For caching datasets in-memory for future operations. • Option to store on disk or RAM or mixed (Storage Level). • Example: • Persist() • Cache()

How Spark works • RDD: Parallel collection with partitions • User application can create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a collection of Tasks (one Task for each Partition).

Summary of Components • Task : The fundamental unit of execution in Spark • Stage: Collection of Tasks that run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset of objects with partitions

Example sc.textFile(“/wiki/pagecounts”) RDD[String] textFile

Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) RDD[String] RDD[List[String]] map textFile

Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) RDD[String] RDD[List[String]] RDD[(String, Int)] map map textFile

Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey map map textFile

Example sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect reduceByKey

Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between Stage 2 Stage 1 collect reduceByKey map map textFile

Execution Plan Stage 2 Stage 1 collect Read shuffle data Final reduce Send result to driver program Read HDFS split Apply both the maps Start partial reduce Write shuffle data reduceByKey map map textFile Stage 2 Stage 1

Stage Execution • Create a task for each Partition in the input RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you don’t have to do anything) Task 1 Task 2 Task 2 Task 2

Spark Executor (Slave) Core 1 Core 2 Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Fetch Input Execute Task Execute Task Execute Task Execute Task Execute Task Execute Task Execute Task Core 3 Write Output Write Output Write Output Write Output Write Output Write Output Write Output

Data Engineering How MapReduce Works

Data Engineering How MapReduce Works

Presentation Transcript

Measuring a (MapReduce) Data Center

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop)

How MapReduce Works

MapReduce and Data Management

Data-Intensive Computing with MapReduce

O’Reilly – Hadoop : The Definitive Guide Ch.6 How MapReduce Works

Data-Intensive Computing with MapReduce

Distributed and Parallel Processing Technology Chapter6. How MapReduce works

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data Processing with MapReduce

ACCURATE ENGINEERING WORKS

MapReduce and Data Management

Simplifying MapReduce Data Processing

ALKA ENGINEERING WORKS

ganesh engineering works

Deccanew Engineering Works

Ramkrishna Engineering Works

How Big Data Hadoop Works?

Shakti Engineering Works

How Big Data works

MapReduce and Data Management