MapReduce : Simpliyed Data Processing on Large Clusters

MapReduce: SimpliyedData Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)

Jeff Dean Sanjay Ghemawat

Introduce Important programming model for large-scale data-parallel application

Motivation - Parallel applications Widely used Special purpose applications - Common functionality Parallelize computation Distribute data Handle failures - Large Scale(Big Data) Data Processing

MapReduce? -Programming Model Parallel Generic Scalable -Data Map(Key-Value) pair -Implementation Commodity clusters Commodity PC

MapReduce? # map(key, val) is run on each item in set emits new-key / new-valpairs # User define function # reduce(key, vals) is run for each unique key emitted by map() emits final output

Example # Distributed Grep (Global/ Regular Expression/ Print) # Count of URL Access Frequency (logs of webpage request) map<URL,1(total)> reduce<URL, total count(n)> # Reverse Web-Link Graph map<target(linked url), source(web page) reduce<target,list(source)>

Example # Term-Vector per Host (<word, frequency>a list of pair) map<hostname, term vector> reduce<hostname, term vector> (throwing away infrequent terms , and emits a final) # Inverted Index map<word, document ID> reduce<word, list(document id)> # Distributed Sort map<key, record> reduce<key record>(emits all pairs unchanged)

Execution overview

Typical cluster # Machines are typically 100s or 1000s of 2-CPU x86 machines(dual-processor x86 processors) running Linux, with 2-4 GB of memory # NetWork 100 megabits/second or 1 gigabit/second • # Storage • Storage is on local IDE disks • # GFS • GFS: distributed file system manages data • # Job scheduling system • - jobs made up of tasks • - scheduler assigns tasks to machines # Language C++ library linked into user programs

Distributed-1? #1 - Split input file into M pieces (16M ~ 64M)(user via optional parameter) - start up many copies of the program on a cluster of machines #2 - Master(1) – on e of the copies of the program is special - worker(n) – assigned work by the master - Map task(M) / Reduce tasks(R) #3 - Map task reads the content (from input split) - pares (key/value pair)  user define map function - buffered in memory #4 Map workers - Periodically, the buffered pairs are written to local disk - the local disk are passed back to the master - who is responsible for forwarding these locations to the reduce workers #5 Reduce workers - it uses remote procedure calls to read the buffered data from the local disks of the map workers

Distributed-2? #6 - reduce worker iterates(unique intermediate key encountered) - start up many copies of the program on a cluster of machines - The output of the Reduce function is appended to a finnal output le for this reduce partition. #7 - When all map tasks and reduce tasks have been completed - the master wakes up the user program - the MapReduce call in the user program returns back to the user code. #8 - After successful completion - R output files(reduce)(file names as specied by the user) - the MapReduce call in the user program returns back to the user code.

Master Data Structures #Status  Idle(비가동)  in-progress(가동)  completed(완료)

Fault Tolerance(결함의 허용 범위) #Worker Failure - The master pings every worker periodically - MapReduce is resilient to large-scale worker failures #Master Failure  mapreduce stop - It is easy to make the master write periodic checkpoints of the master data structures described above. - If the master task dies, a new copy can be started from the last checkpointedstate. - Clients can check for this condition and retry the MapReduce operation if they desire. #Semantics in the Presence of Failures (실패의 의미)

Locality(지역성) #GFS 저장  네트워크 대역폭 절약 GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. #When running largeMapReduceoperations on a signicant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.

Task Granularity # 이상적인 : Map (M) , Reduce(R) M,R > Machines - 동적 로드벨런싱 향상 - worker failure  복구시간 향상 #Master O(M+R) 개의스캐줄링생성 O(M+R)개의 상태가 메모리에 유지  실질적인 허용 범위가 존재함 O(M+R)의 상태는 최소 1byte로 구성됨 #reduce(r) 사용자 로부터 제약을 받음 (각각의 시스템에서 처리 됨으로) #M=200,000 개 R=5,000개 (Machines)Worker=2000 환경에서 MapReduce연산을 수행

Backup Tasks # ”Straggler” 낙오자  Machines 전체 연산 중 가장 나중에 수행 되는 매우 처리가 오래 걸리는 map or reduce task # When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. #The task is marked as completed whenever either the primary or the backup execution completes.

Combiner Function N1 Master CPU Performance Network Traffic N3 N2 Map Task Map Task Reduce Task Reduce Task Reduce Task Map Task

Status Infomation #The master runs an internal HTTP server and exports a set of status pages for human consumption #how many tasks have been completed #how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates # The user can use this data to predict how long the computation will take

Conclusions #First, the model is easy to use, even for programmers without experience with parallel and distributed systems, # Second, a large variety of problems are easily expressible as MapReduce computations # Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines # First, restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant. # Second, network bandwidth is a scarce resource. # Third, redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.

MapReduce : Simpliyed Data Processing on Large Clusters