Power-Aware Parallel Job Scheduling

Power-Aware Parallel Job Scheduling Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero {maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es

EEHiPC'10 Power Consumption of Supercomputing Systems • Striving for performance has led to enormous power dissipation of HPC centers (Top500 list) KWatts

EEHiPC'10 Power reduction approaches in HPC Power reduction approaches Application level: - Runtime systems: - exploit certain application characteristics (load imbalance, communication intensive regions) - based on very fine grain DVFS application System level: - Turning off idle nodes: - resource allocation such that there are more completely idle nodes - determining number of online nodes - Operating system power management via DVFS: - linux governors – per core, unawareness of the rest of the system - DVFS taking into the account entire system workload?

EEHiPC'10 Parallel Job Scheduling Wait Queue Queued jobs • Job scheduler has a global view of the whole system Job submission HPC Job Scheduler Job with its requirements Job Scheduling Resource Manager

EEHiPC'10 DVFS and Job Scheduling Wait Queue Queued jobs Job submission HPC Job Scheduler Job with its requirements Job Scheduling Resource Manager Power-Aware Component Job CPU frequency assignment based on goals/constraints

EEHiPC'10 Outline • Parallel job scheduling: • short introduction to parallel job scheduling • the EASY backfilling policy • Power and run time modelling: • first we need to understand how frequency scaling affects CPU power dissipation and runtime • Energy-saving parallel job scheduling policies: • Utilization-driven power-aware scheduling [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Utilization driven power-aware parallel job scheduling. Energy Aware High Performace Computing Conference, Hamburg, September 2010] • BSLD-driven power-aware scheduling [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Bsld threshold driven power management policy for hpc centers. IEEE Parallel and Distributed Processing Symposium, HPPAC Workshops 2010 Atlanta, GA, April 2010] • Power-budgeting: • how to maximize job performance under a given power budget? [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Optimizing job performance under a given power constraint in hpc centers. IEEE International Green Computing Conference, Chicago, IL, August 2010]

EEHiPC'10 About Parallel Job Scheduling CPUs • Parallel job scheduling can be seen as finding a free rectangle for the job being scheduled: • FCFS policy used in the beginning • Backfilling policies introduced to improve system utilization • Job performance metrics: • Response time: WaitTime(J)+RunTime(J) • Slowdown: (WaitTime(J) + RunTime(J))/RunTime(J) • Bounded Slowdown: max((WaitTime(J)+Runtime(J))/max(Th,RunTime(J)) ,1) Job 5 Job 3 Job 2 Job 1 Job 4 Job 6 Time Job Performance Wait Time Run Time

EEHiPC'10 The EASY backfilling policy • Jobs are executed in FCFS order except when the first job in the wait queue can not start • Users have to submit an estimation of job's runtime – requested time • When the first job in the WQ can not start, a reservation is made for it based on requested times of running jobs • A job is executed before previously arrived ones only if it does not delay the first job in the wait queue CPUs Arrival of Job 5 Arrival of Job 6 Job 5 Job 1 Job 2 MakeJobReservation(Job5) Job 5 Job 3 BackfillJob(Job6) Job 4 Job 6 Time

High-Level DVFS modelling

EEHiPC'10 Power Model • CPU power presents one of main system power components • It consists of dynamic and static power: • Pcpu = Pdynamic + Pstatic • Pdynamic = AcfV2 Pstatic = α V • Fraction of static in total CPU power is a model parameter: • Pstatic(Vtop) = X(Pstatic(Vtop) + Pdynamic (ftop,Vtop)) ( X = 25% in our experiments ) • Average activity factor assumed to be same for all jobs (2.5 times higher than idle activity) • Idle processors: do not consume power/ consume power at the lowest frequency • DVFS gear set :

EEHiPC'10 Number of CPUs Distribution Less or equal to 4 N(0.5, 0.01) Between 4 and 32 N(0.4, 0.01) More than 32 N(0.3, 0.064) Time Model • Execution time dependence on frequency is captured by the following model: F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1 [Hsu,Feng SC05: A Power-Aware Run Time System for High-Performance Computing] • ß is assumed to have the following normal distributions: • Global application ß depends on communication/computation ratio • Two ß scenarios: • ß is known in advance (at the moment of scheduling) • ß is not known in advance (at the moment of scheduling the worst case, ß = 1, is assumed ) ß=0.7 ß=0.5 ß=0.3

Energy Saving Parallel Job Scheduling Policies

EEHiPC'10 Utilization-Driven Policy • Frequency assigned once (at jobs start time) for entire job execution based on system utilization • Utilization is computed for each interval T: • An additional control over system load WQthreshold: • If there are more than WQthreshold jobs in the wait queue no frequency scaling will be applied • Otherwise, job started during interval Jkruns at frequency F ftop Fk fupper flower Uk-1 Ulower Uupper

EEHiPC'10 Evaluation Alvio simulator • C++ event driven parallel job scheduling simulator has been upgraded • Policy parameters: utilization thresholds: Ulower= 50% Uupper= 80% reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz utilization computation interval: T = 10 min wait queue length threshold: WQthreshold = 0, 4, 16, NO - limit • Metric of job performance – Bounded Slowdown: • BSLD at frequency f : Policy parameters Metric of performance

EEHiPC'10 Workloads • Five workloads from production use have been simulated: Cornell Theory Center -large jobs with relatively low level of parallelism San Diego Supercomputing Center -less sequential jobs than CTC -runtime distribution similar Lawrence Livermore National Lab - small to medium size jobs Lawrence Livermore National Lab - large parallel jobs San Diego Supercomputing Center - no sequential job Parallel workload archive http://www.cs.huji.ac.il/labs/parallel/workload

EEHiPC'10 Results: Normalized CPU Energy short wait queues very similar results for both energy scenarios savings of not highly loaded workloads up to 12%

EEHiPC'10 Results: Normalized Performance high penalty in the least conservative case for highly loaded workload WQ threshold has almost no impact an increase in number of backfilled jobs

EEHiPC'10 Average frequency - SDSCBlue

EEHiPC'10 BSLD-Driven Policy • Frequency is assigned based on job's predicted performance • Lower frequency -> longer execution time • -> worse job performance metric • BSLDth controls allowable performance penalty (“target BSLD”) • In order to be run at lower frequency f a job has to satisfy BSLD condition at frequency f: if the job's predicted BSLD at frequency f is lower thanBSLDththan it satisfies the BSLD condition at frequency f • Predicted BSLD: Job Ji WQsize≤ WQthreshold NO Run Jiat Ftop YES f = Flowest find an allocation Alloc satisfiesBSLD(Alloc,Ji,f) or f=Ftop NO f = next higher frequency YES Run job Ji at frequency f

EEHiPC'10 Results: Normalized CPU Energy Normalized energies in two energy scenarios behave in the same way Average savings in the most aggressive case: 5% - 23% Difference in savings per workload for the most conservative and the most aggressive threshold combinations goes from 5% (SDSC) to 15% (LLNLThunder) WQthresholdcontrols DVFS aggressiveness much better than BSLDthreshold BSLDthresholdhas stronger impact when WQthresholdishigher

EEHiPC'10 Average BSLD 24.91 Strong impact on performance in the most aggressive case Impact of WQthresholdhigher than of BSLDthreshold BSLDthresholdhas stronger impact when WQthreshold is higher 1 1.08 5.15 4.66 Decrease in performance is proportional to energy savings

EEHiPC'10 Reduced jobs (out of 5000)‏ Performance depends on the number of reduced jobs It depends on used frequencies as well It was remarked that performance of jobs that have been run at the nominal frequency was affected as well When load is very high (SDSC) no DVFS is applied (in order to apply it thresholds have to be set to higher values)

EEHiPC'10 Wait time Main problem observed: -> high impact on wait time Zoom of SDSCBlue wait time behavior

Power-Budgeting Policy

EEHiPC'10 PB-Guided Policy: How DVFS can improve overall job performance NO DVFS CASE J3 J5 ftop J1 Wait Queue: J2 J4 J5 Time T1 T2 ftop J4 J5 J3 DVFS CASE J4 Power Budget flower J2 J3 J1 J2 J1 penalty in run time due to frequency scaling but more jobs can run simultaneously

EEHiPC'10 Power Budgeting: PB-Guided Policy • Frequency assignment is guided by predicted job performance and current power draw • Prediction of BSLD when selecting frequency: • BSLD condition: A job satisfies BSLD condition at reduced frequency f if its predicted BSLD at the frequency f is lower than current value of the BSLD threshold • The policy is power conservative: • A job will be scheduled at the lowest frequency at which both BSLD condition and power limit are satisfied • The closer to the PB limit, the • higher the BSLD threshold • The higher the BSLD threshold, • the lower frequency will be selected BSLD threshold Pcurrent 0 Plower Pupper Power Budget

EEHiPC'10 Power Budgeting: PB-Guided Policy MakeJobReservation(J)‏ 1: scheduled <-- false; 2: shiftInTime <-- 0; 3: nextFinishJob <-- next(OrderedRunningQueue); 4: while( !scheduled)‏ { 5: f <-- FlowestReduced 6: while(f < Fnominal) { 7: Alloc = findAllocation(J,currentTime + shiftInTime,f); 8: if (satisfiesBSLD(Alloc, J, f) and satisfiesPowerLimit(Alloc, J, f) ) 9: { schedule(J, Alloc); 10: scheduled <-- true; 11: break; } 12: if (f == Fnominal) 13: Alloc = findAllocation(J,currentTime + shiftInTime, Fnominal) 14: if (satisfiesPowerLimit(Alloc, J,Fnominal)) 15: schedule(J, Alloc); 16: break; 17: shiftInTime <-- FinishTime(nextFinishJob) - currentTime; 18: nextFinishJob <-- next(OrderedRunningQueue); } BackfillJob(J)‏ 1: f <-- Flowest 2: while(f < Fnominal) { 3: Alloc = TryToFindBackfilledAllocation(J,f); 4: if (correct(Alloc) and satisfiesBSLD(Alloc, J,f) and satisfiesPowerLimit(Alloc,J,f)) 5: { schedule(J, Alloc); 6: break; } 7: f <-- nextHigherFrequency } 8: if (f==Fnominal) 9: { Alloc = TryToFindBackfilledAllocation(J,Fnominal); 10: if ((correct(Alloc) and satisfiesPowerLimit(Alloc,J,f)) 11: schedule(J, Alloc); } • A job can be scheduled with one of the two functions: the lowest frequency that satisfies it will be selected BSLD Condition power budget must not be violated during entire job execution Power Limit

Workload - # CPUs Jobs(K)‏ Avg BSLD Utilization Over PB CTC – 430 20 - 25 4.66 70% 72% SDSC – 128 40 - 45 24.91 85% 95% SDSCBlue – 1152 20 – 25 5.15 69% 74% LLNLThunder - 4008 20 - 25 1 80% 89% Evaluation • Policy parameters: • Power budget thresholds: Plower= 0.6 , Pupper = 0.9 • BSLD threshold values which have been used: BSLDlower = avg(BSLD) without power budgeting BSLDupper = 2* BSLDlower • Power budget • set to 80% of the total CPU power consumed by whole system when running at Fnominal • Four workloads from production use have been simulated:

EEHiPC'10 Baseline Power Budgeting Policy • Power limited without DVFS: • No job will start if it would violate the budget although there are available processors • This case is equal to the EASY scheduling with a smaller machine Arrival of Job 6 CPUs Arrival of Job 5 Arrival of Job 4 Job 4 Job 1 Job 6 Job 2 MakeJobReservation(Job5)‏ Job 4 Job 3 BackfillJob(Job6)‏ Job 5 Job 6 can not start because of power budget

EEHiPC'10 Results: Performance • Oracle case: it is assumed that ß values are known at the scheduling time PB-guided policy shows better performance for all workloads! AVG wait time decreases with DVFS under power constraint

EEHiPC'10 Results: Normalized CPU Energy (idle=0) • Oracle case: it is assumed that ß values are known at the scheduling time

EEHiPC'10 Utilization Over Time

EEHiPC'10 Power Budget Consumed

EEHiPC'10 Comparison of Unknown and Known ß Avg.BSLD, Avg.WT and Avg.Energy values are normalized with respect to corresponding baseline values (EASY-backfilling with power limit and without DVFS)

EEHiPC'10 Conclusions • Energy – performance trade-off must be done carefully as DVFS does not affect only job runtime but it can affect significantly job wait time and additionally decrease job performance • Performance-energy trade-off needs to be done at job scheduling level as it affect jobs in the wait queue and only the scheduler can estimate potential negative impact on queued jobs • DVFS application to highly loaded workloads (SDSC) leads to very high performance penalty • Parallel job scheduling policies can be designed such that maximizes job performance under a given power constraint • It has been shown that DVFS can improve performance in power constrained HPC centers (using lower CPU frequencies allows more job to run simultaneously)‏ • It is not necessary to know ß values in advance, moreover assuming the worst case at scheduling time can give better performance than when they are known in advance

EEHiPC'10 Thank you for your attention!

Power-Aware Parallel Job Scheduling

Power-Aware Parallel Job Scheduling

Presentation Transcript

Power-aware scheduling

Job Shop Scheduling

A Parallel Genetic Algorithm FOR Predictive Job Scheduling

Stochastic optimization for power-aware distributed scheduling

QoPS: A QoS based Scheme for Parallel Job Scheduling

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Power aware scheduling/workload placement over cloud

Temperature-Aware Job Scheduling

Scheduling Generic Parallel Applications –Meta-scheduling

Parallel Machine Scheduling

Parallel Job Scheduling Algorithms and Interfaces

Pinwheel Scheduling for Power-Aware Real-Time Systems

Parallel Computing Systems Part III: Job Scheduling

Online Job Scheduling

Power-aware Dynamic Soft-Real-time Scheduling Framework

Job Scheduling

Job Scheduling

Bypass Aware Instruction Scheduling for Register File Power Reduction

Job scheduling

JOB SHOP SCHEDULING

Power-aware scheduling