1 / 47

Fairness in Job Scheduling on CPlant

Fairness in Job Scheduling on CPlant. Vitus Leung Sandia National Labs Gerald Sabin RNET Technologies, Inc P. Sadayappan The Ohio-State University.

chloe
Download Presentation

Fairness in Job Scheduling on CPlant

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fairness in Job Scheduling on CPlant Vitus Leung Sandia National Labs Gerald Sabin RNET Technologies, Inc P. Sadayappan The Ohio-State University Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. Table Of Contents • Introduction to Job Scheduling • The original C-Plant/Ross scheduler • Improving fairness • Simulation Environment • Results • Conclusion • Questions

  3. Introduction to Job Scheduling • Independent parallel jobs • Job specifies number of nodes and expected runtime • Jobs run on a parallel machine with a fixed number of nodes (C-Plant Ross) • Examples: PBS, MAUI, …

  4. Introduction to Job Scheduling • Primary focus of research in job scheduling has been to increase utilization and improve desired user metrics • Very little research so far that has addressed “fairness” in job scheduling • CPlant scheduler uses a “fair-share” measure to order jobs in the queue: How fair is the scheduler?

  5. Assessing Fairness • Possible approach: For each job, find number of jobs with higher usage-count that are serviced while the job waits • Problem: How to account for “benign” back-filling that uses slots in schedule not usable by this job? • Proposed approach: Assign a “fair-start” time for each job when it is submitted • by generating a non-backfilling, in order schedule based on fairness priority • if actual start time does not exceed fair-start time, job is considered fairly treated, else unfair

  6. Introduction to Job Scheduling (options) • Reservation Depth – Number of jobs which are reserved/”blocked” • 0 (No Guarantee Backfilling) • 1 (Aggressive/EASY Backfilling) • Unlimited (Conservative Backfilling) • Queue Priority – Sorting of waiting jobs • FCFS, SJF, LJF, Fairness • Starvation Limits/Selective Reservations (provide “artificial” starvation limits) • Wait time limit • Usage (node*hours) • How many jobs can starve (per system or per user?) • Only “fair” jobs can starve?

  7. Depth of Reservation Conservative Backfilling Queue sorted in priority order Q2 Q1 Q5 Q3 Q4 Q4 Q2 Q1 R1 Processors Q5 R2 Q3 Time

  8. Depth of Reservation Conservative Backfilling Queue sorted in priority order Q2 Q1 Q5 Q3 Q4 Q2 Q1 R1 Processors Q5 R2 Q3 Time

  9. Depth of Reservation Conservative Backfilling Queue sorted in priority order Q2 Q1 Q5 Q3 Q4 Q5 Q2 Q1 R1 Processors R2 Q3 Time • No job is delayed by a latter arriving job • Higher priority jobs have a better chance of backfilling • Guaranteed starvation free and bounded delays

  10. Depth of Reservation Aggressive Backfilling Queue sorted in priority order Q2 Q1 Q3 Q2 Q1 R1 Processors R2 Time

  11. Depth of Reservation Aggressive Backfilling Queue sorted in priority order Q2 Q1 Q5 Q3 Q4 Q3 Q2 Q1 R1 Processors R2 Q5 Q4 Time

  12. Depth of Reservation Aggressive Backfilling Queue sorted in priority order Q2 Q1 Q5 Q3 Q4 Q3 Q2 Q1 R1 Processors R2 Q5 Q4 Time • Possibility for longer narrow jobs to start • All but the first job can be continually unfairly delayed • Starvation free (assuming progress in queue priority) but unbounded delays

  13. Depth of Reservation No Guarantee Backfilling Q2 Q1 Q3 Q2 Q1 Processors R1 R2 Time

  14. Depth of Reservation No Guarantee Backfilling Q2 Q1 Q4 Q3 Q2 Q3 Q1 Processors R1 R2 Time

  15. Depth of Reservation No Guarantee Backfilling Q2 Q1 Q4 Q5 Q3 Q3 Q2 Q4 Q1 Processors R1 R2 Q5 Time

  16. Depth of Reservation No Guarantee Backfilling Q2 Q1 Q4 Q5 Q3 Q3 Q2 Q4 Q1 Processors R1 R2 Q5 Time • First job which fits is selected • Starvation is a problem • All job can continually be unfairly delayed • Possibly good utilization

  17. Queuing Priority • FCFS • “fair” on a per job basis • guarantees a static queue order • SJF/LJF/WJF… • Reorder jobs for backfilling order • Attempt to improve average user metrics and utilization by sorting jobs in an “intelligent” way • Newly arriving jobs can move other jobs back in the queue • Possibility of starvation unless all jobs have static reservations • Fair Share • Reorders jobs • Attempts to improve user “fairness”

  18. “Starvation” Thresholds • Scheduler changes normal policy for a “starving job” when some threshold is crossed, e.g. wait-time of 1 day • Selective reservations for a starving job • Attempt to eliminate starvation with a scheduling policies which is not starvation free • Not needed if policy is starvation free • Many free variables which needed tweaking (and are not dynamic) • Can adversely affect fairness for other jobs

  19. “Starvation” cont. • When is a job starving? • Exceeded wait time? • Exceeded slowdown? • What value of wait time/slowdown? • Can a user who has used more than their “fair share” be considered starving? • What binary limit do you place on fair share? • How many starving jobs get a reservation? • Per user or per system?

  20. Table Of Contents • Introduction to Job Scheduling • The original C-Plant/Ross scheduler • Increasing fairness • Simulation Environment • Results • Conclusion • Questions

  21. C-Plant scheduler • No Guarantee backfilling • Fair share queue priority (decaying node-hours) • Jobs with a waittime > 24/72 hours are considered starving and: • Are placed in a virtual queue by receiving a higher priority than non-starving jobs • Are sorted in FCFS order instead of by fairshare • Head of queue has a reservation (aggressive backfilling)

  22. C-Plant scheduler (implications) • Jobs do not necessarily run in fair share order • Allows for unfair use of the machine • No Guarantee Backfilling • “Starvation Queue”/FCFS order • Unbounded wait times and starvation forces system admins to start jobs manually • “Good” utilization and average user metrics

  23. Table Of Contents • Introduction to Job Scheduling • The original C-Plant/Ross scheduler • Increasing fairness • Simulation Environment • Results • Conclusion • Questions

  24. Suggestions to Improve Fairness • Runtime Limitation • Cap runtimes at 72 hours • Improves fairness by allowing “preemption” • Improve user metrics by allowing “preemption” • Scripts have been developed to help ease the burden on the user • Minimal impact on fair long jobs expected

  25. Suggestions to Improve Fairness • Increase starvation limit from 24 to 72 (or greater?) • Reduces “unfairness” due to FCFS queue • Does not address lack of fairness due to no guarantees • Prevents jobs from starving forever • Minimal impact on standard average user metrics and utilization

  26. Suggestions to Improve Fairness • Do not allow a “starving reservation” for users who are “hogging” the machine • Introduces fairness to the “virtual” starvation queue • Very minimal impact on standard user metrics and utilization • Only tracks usage through system time windows • Simple change to existing scheduler, minimal impact to normal users

  27. Suggestions to Improve Fairness • Conservative Backfilling • Eliminates starvation • Queue still sorted by “fair-share”, fairness still matters • Deterministic worst case start time upon submittal of each job • FCFS “feel”, each job receives an initial reservation in arrival order • An unfair job can still delay a fair job during backfilling

  28. Suggestions to Improve Fairness • Conservative backfilling with dynamic reservations • Removes FCFS “feel” from conservative backfilling • An job can never delay a more “fair” job • Starvation is possible • User has control • If does not submit jobs (or adds artificial dependencies), progress in the queue is guaranteed, eliminating starvation • Implements the spirit of the fair share policy • No unfair job will ever delay a more fair job

  29. Table Of Contents • Introduction to Job Scheduling • The original C-Plant/Ross scheduler • Increasing fairness • Simulation Environment • Results • Conclusion • Questions

  30. Simulation Environment • Event driven simulator • Actual CPlant traces are used as input to the simulator • CPlant/Ross • December 02 – June 03

  31. Table Of Contents • Introduction to Job Scheduling • The original C-Plant/Ross scheduler • Increasing fairness • Simulation Environment • Results • Conclusion • Questions

  32. Results • Original CPlant/Ross policy • 24 hours starvation, any job can starve, no maximum runtime • Small tweaks • 72 hour starvation, any job can starve, no max runtime • 24 hour starvation, unfair jobs can not starve, no max runtime • 24 hour starvation, any job can starve, 72 hour max runtime • 72 hours starvation, unfair jobs can not starve, 72 hours max runtime

  33. Results • Reduce % jobs which miss fair start time • Loss of capacity is generally slightly lower • Combining all three “enhancements” shows the most improvement

  34. Results • “Heavy” users with high wait times benefit • Reduce “extreme” wait time for mid-range users • A very light user actually gets worse • Still seems unfair

  35. Results • “Heavy” users with high wait times benefit • Reduce “extreme” wait time for mid-range users • A very light user actually gets worse • Still seems unfair

  36. Results • Fundamental changes • Conservative backfilling • Conservative backfilling with 72 hr runtime limits • Conservative backfilling with dynamic reservations • Conservative backfilling with dynamic reservations and 72hr runtime limits

  37. Results • Possible to further reduce percent of jobs which miss fair start time • Conservative backfilling (static) can be bad for fairness • Small increase (~3%) in loss of capacity (for dynamic reservations)

  38. Results • Heavy user are appropriately penalized • Light user are given better treatment • Medium users can still perform worse than heavy users

  39. Results • Heavy user are appropriately penalized • Light user are given better treatment • Medium users can still perform worse than heavy users

  40. Results • Previously unfair user improve the most • No dramatic increase in waittime • Most users improve

  41. Results • Previously unfair user improve the most • No dramatic increase in waittime • Most users improve

  42. Results • Previously unfair user improve the most • No dramatic increase in waittime • Most users improve

  43. Conclusions • Proposed a new way of quantitatively assessing how well a fair-share policy is implemented by a scheduler • The original scheduling policy causes unfair treatment of about 10% of jobs • Effect of several possible changes to scheduling policy were evaluated through simulations • Change of starvation threshold (from 24 to 72 hours) • Imposition of maximum time limit for jobs • Disallowing “unfair” jobs from starvation-queue • Use of reservations for all jobs (conservative back-fill variations) instead of starvation-queue mechanism • Simulations show that modifications can reduce unfairness to under 3% of jobs • Several issues for further investigation

  44. Future Work • More detailed analysis • Trade-off between fairness and average response time • Extent of unfairness experienced by different job categories • “Robustness” of scheduler under high-load and user-hogging sceenarios • Perform analysis on other Sandia traces (West/Alaska) to find any possible inconsistencies and trace dependent results. • Determine desirable scheduling strategy for the “Institutional Cluster” • Improve utilization while maintaining fairness • Slack based backfilling • Selective reservations • Generate blocking without backfilling • Use expansion factor/wait time to influence priority

  45. Future Work • Effects of limiting job submissions • Number of jobs; Node hour limits User/Admin control over fairness • Allow a more flexible priority to take into account user needs • User defined checkpointing • Allow users to inform scheduler of checkpoint and send a signal • Checkpointed jobs can achieve lower turnaround time by taking advantage of currently unused cycles (improve utilization) • Transparency • Give user estimates (how accurate can we be under different scheduling policies)

  46. Acknowledgments • Thanks to Jeanette Johnston for the discussions regarding the current policy and possible improvements • Thanks to Jon Stearly for going out of his way to discuss “fairness” and for getting the raw Cplant logs

  47. Questions?

More Related