簡報人：碩資工一甲董耀文

A Dynamic MapReduce Scheduler for Heterogeneous WorkloadsChao Tian, Haojie Zhou , YongqiangHe,LiZha 簡報人：碩資工一甲董耀文

Outline • Background • Question? • So! • Related work • MapReduce procedure analysis • MR-Predict • Schedule policys • Evaluation • Conclusion

Background • As the Internet scale keeps growing up, enormous data needs to be processed in many Internet Service Providers. • MapReduce framework is now becoming a leading example solution, it’s designed for building large commodity cluster, which consist of thousands of nodes by using commodity hardware.

Background • The performance of a parallel system like MapReduce system closely ties to its task scheduler. • Current scheduler in Hadoopuses a single queue for scheduling jobs with a FCFS method. • Yahoo’s capacity scheduler as well as Facebook’s fair scheduler uses multiple queues for allocation differnet resource in the cluster.

Background • In practical, different kinds of jobs often simultaneously run in the data center.These different jobs make different workloads on the cluster, including the I/O-bound and CPU-bound workloads.

Background • The characters of workloads are not aware by Hadoop's scheduler which prefers to simultaneously run map tasks from the same job on the top of queue. • This may reduce the throughput of the whole system which seriously influences the productivity of data center, because tasks from the same job always have the same character.

Question • How to improve the hardware utilization rate when different kinds of workloads run on the clusters in MapReduceframework?

SO! • They design a new triple-queue scheduler which consist of a workload predict mechanism MR-Predict and three different queues (CPU-bound queue, I/O-bound queue and wait queue). • They classify MapReduceworkloads into three types, and their workload predict mechanism automatically predicts the class of a new coming job based on this classification. • Jobs in the CPU- bound queue or I/O-bound queue are assigned separately to parallel different type of workloads. • Their experiments show that can Approach could increase the system throughput up to 30%

Related work • Scheduling algorithms in parallel system [11,…] • Applications have different workloads • large computation and I/O requirements [10]. • How I/O-bound jobs affect system performance[6]. • A gang schedule algorithm which parallel the CPU- bound jobs and IO-bound jobs to increasing the utilization of hardware[7].

Related work • The schedule problem in MapReduceattracted many attentions[2,10]. • Yahoo and Facebook designed schedulers of Hadoop as capacity scheduler [4] and Fair scheduler [5].

MapReduce procedure analysis • Map-shuffle phase • Init input data • Compute map task • Store ouput result to local disk • Shuffle map tasks result data out • Shuffle reduce input data in

MapReduce procedure analysis • Reduce-Compute phase • tasks run the application logic

MR-Predict

Schedule policys

Evaluation • Environment • 6 node connect gigabyte Etherent. • DELL1950 • CPU: 2 Quard Core 2.0GHz • Memory: 4GB • Disk: 2 SATA disk • Input data: 15GB • map slots & reduce slot: 8 • DIOR: 31.2 MB/s (without reduce phase in Hadoop)

Evaluation • Resource utilizations TeraSort: Total order sort (sequential I/O )benchmark 8 ( 64MB + 64 MB ) / 8 >= 31.2 MB/s

Evaluation • Resource utilizations Grep-Count: use [.]* as the regular expression. 8 ( 64MB + 1MB + 1MB + SID ) / 92 >= 31.2 MB/s

Evaluation • Resource utilizations WordCount: It splits the input text into words, shuffles every word in map phase and counts its occupation number in reduce phase. 8 ( 64MB + 64 MB + 64MB + SID ) / 35 >= 31.2 MB/s

Evaluation • Triple queue scheduler experiments • Every job runs five times & total 15 jobs will run

Conclusion • Scheduler correctly distributes jobs into different queues in most situations. • Triple Queue Scheduler could • increase the map tasks throughput 30% • save the makespan 20%

Thank you for listening.

簡報人：碩資工一甲 董耀文