The Hadoop Fair Scheduler

The Hadoop Fair Scheduler Matei Zaharia Cloudera / Facebook / UC Berkeley UC Berkeley

Outline • Motivation / Hadoop usage at Facebook • Fair scheduler basics • Configuring the fair scheduler • Future plans

Motivation • Provide short response times to small jobs in a shared Hadoop cluster • Improve utilization over private clusters / HOD

Hadoop Usage at Facebook • Data warehouse running Hive • 600 machines, 4800 cores, 2.4 PB disk • 3200 jobs per day • 50+ engineers have used Hadoop

Facebook Data Pipeline Web Servers Scribe Servers Network Storage Hive Queries Analysts Summaries Hadoop Cluster MySQL Oracle RAC

Facebook Job Types • Production jobs: load data, compute statistics, detect spam, etc • Long experiments: machine learning, etc • Small ad-hoc queries: Hive jobs, sampling GOAL: Provide fast response times for small jobs and guaranteed service levels for production jobs

Fair Scheduler Basics • Group jobs into “pools” • Assign each pool a guaranteed minimum share (split up among its jobs) • Split excess capacity evenly between jobs

Pools • Determined from a configurable job property • Default (before 0.20): mapred.queue.name • At Facebook: user.name (one pool per user) • Unmarked jobs go into a “default pool” • Pools have properties: • Minimum map slots • Minimum reduce slots • Limit on # of running jobs

Scheduling Algorithm • Divide each pool’s min share among its jobs • Divide excess capacity among all jobs* • When a slot needs to be assigned: • If there is any job below its min share, schedule it • Else schedule the job that we’ve been most unfair to (based on “deficit”) * Fair schedulers from Hadoop 0.20 on will share equally between pools, not jobs; patch available at https://issues.apache.org/jira/browse/HADOOP-4789

Scheduler Dashboard

Scheduler Dashboard Reassign pool

Scheduler Dashboard Change priority

Scheduler Dashboard FIFO mode (for testing)

Additional Features • Job weights for unequal sharing: • Based on priority (each level is 2x more) • Based on size (mapred.fairscheduler.sizebasedweight) • Limits on # of running jobs: • Per user • Per pool

Installing the Scheduler • Compile it: • ant package • Place it on the classpath: • cp build/contrib/fairscheduler/*.jar lib • Alternatively, add the JAR to HADOOP_CLASSPATH in conf/hadoop-env.sh

Configuration Files • Hadoop config (conf/hadoop-site.xml) • Contains scheduler options, pointer to pools file • Pools file (pools.xml) • Contains min share allocations and limits on pools • Reloaded every 15 seconds to allow reconfiguring pools at runtime

Minimal hadoop-site.xml <property> <name>mapred.jobtracker.taskScheduler</name> <value>org.apache.hadoop.mapred.FairScheduler</value> </property> <property> <name>mapred.fairscheduler.allocation.file</name> <value>/path/to/pools.xml</value> </property>

Minimal pools.xml <?xml version="1.0"?> <allocations> </allocations>

Configuring a Pool <?xml version="1.0"?> <allocations> <pool name="ads"> <minMaps>10</minMaps> <minReduces>5</minReduces> </pool> </allocations> • Any pools not configured in pools.xml will have minMaps=0 and minReduces=0

Setting Running Job Limits <?xml version="1.0"?> <allocations> <pool name="ads"> <minMaps>10</minMaps> <minReduces>5</minReduces> <maxRunningJobs>3</maxRunningJobs> </pool> <user name="matei"> <maxRunningJobs>1</maxRunningJobs> </user> </allocations>

Default Jobs Limit for Users <?xml version="1.0"?> <allocations> <pool name="ads"> <minMaps>10</minMaps> <minReduces>5</minReduces> <maxRunningJobs>3</maxRunningJobs> </pool> <user name="matei"> <maxRunningJobs>1</maxRunningJobs> </user> <userMaxJobsDefault>10</userMaxJobsDefault> </allocations>

Other hadoop-site.xml Properties mapred.fairscheduler.assignmultiple: • Assign a map and reduce on each heartbeat; improves ramp-up speed and throughput; recommendation: set to true

Other hadoop-site.xml Properties mapred.fairscheduler.poolnameproperty: • Which jobconf property to use to determine what pool a job is in • Default: mapred.queue.name (queue name) • Another useful option: user.name • Can also make up your own, e.g. “project”

Other hadoop-site.xml Properties mapred.fairscheduler.weightadjuster: • Allows modifying job weights through a plugin class; one useful example is provided – a new job booster to let short jobs finish faster: Please see README for details <property> <name>mapred.fairscheduler.weightadjuster</name> <value>org.apache.hadoop.mapred.NewJobWeightBooster</value> </property>

Future Plans • Share equally between pools, not jobs (Hadoop 0.20 release, HADOOP-4789) • Preemption if a job is starved of its min or fair share for some timeout (HADOOP-4665) • Locality wait optimization (HADOOP-4667)

Future Plans • Simpler scheduling model (HADOOP-4803) • FIFO pools (HADOOP-4803, HADOOP-5186) • Delayed job initialization (HADOOP-5186) • Scalability and operational improvements

Thanks! • The Fair Scheduler is available in Hadoop 0.19; docs in src/contrib/fairscheduler/README • Hadoop 0.17 and 0.18 versions at http://issues.apache.org/jira/browse/HADOOP-3746 matei@cloudera.com

The Hadoop Fair Scheduler