320 likes | 431 Views
Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling. M. Islam , P. Balaji , G. Sabin and P. Sadayappan. Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies.
E N D
Analyzing and Minimizing the Impact of Opportunity Cost inQoS-aware Job Scheduling M. Islam, P. Balaji, G. Sabin and P. Sadayappan Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies
Job Schedulers Today • Publicly Usable Supercomputer Centers • Becoming increasingly common (OSC, SDSC, etc) • Jobs submitted with resource requirements • CPUs, Memory, Estimate Runtime • Scheduler maps the requirements of the jobs to available resources • If resources are available, job is scheduled immediately • Else, queued and scheduled to execute at a later time • Several job schedulers existing today: PBS, Maui, Silver • Independent Parallel Job Scheduling Model • Dynamically arriving Independent Parallel Jobs • Popular model in most supercomputers
Simple Job Scheduler Model Processor Space Job Scheduler Reservation Queue P1 P2 J2 User Execution Queue P3 P4 J3 J1 Job J1; 2 processors; 1 hour Job J2; 5 processors; 1 hour Job J3; 4 processors; 1 hour P5 P6 Processors’ Status
Running Jobs Two Dimensional Scheduling Grid J4 Job Queue J5 J6 J3 Processors J2 J1 Time Current Time
Previous Research in Job Scheduling • Significant prior research on best-effort scheduling • Optimizations proposed for different metrics • Utilization (U): what fraction of the resources is actually utilized. • U = Resource Used / Resource Provided • Response Time (RT): Time from submission to completion • RT = Job’s completion time – Job’s arrival time • Slowdown (SD): How much slower is the system as compared to a dedicated system • SD = Job’s Response Time / Job’s Runtime • Prioritization: Static (user or group based) and Dynamic (how long the job was in the queue) • NERSC cluster provides static prioritization based on job cost
QoS in Job Scheduling • Users can request for guarantees in turnaround time • E.g., Submit a job before leaving work at 5pm and request for a deadline at 8am the next morning • Two Components for QoS in Job Scheduling • Job Scheduling Component [islam03:qops] • Admission Control: Can we meet the specified deadline? • Once admitted, cannot miss the specified deadline • Revenue Management • Appropriate charging model • Urgent jobs cost more than non-urgent jobs • Need to prioritize jobs such that the incoming revenue is maximized [islam03:qops] “QoPS: A QoS based scheme for Parallel Job Scheduling”, M. Islam, P. Balaji, P. Sadayappan and D. K. Panda. Published in JSSPP ’03 and LNCS ‘04.
Running Jobs Opportunity Cost in Job Scheduling J5 (500$) D4 D5 J4 (10$) J3 Processors J2 J1 Time Current Time By scheduling J4, we lost the future opportunity to schedule the more expensive job J5 J4 has an opportunity cost of at least 500$
Problem Statement • When the user submits a job, she pays an explicit cost • However, the system also pays an implicit opportunity cost • Accepting a job is beneficial if its explicit cost is greater than its opportunity cost • How do we determine the opportunity cost? • It depends on future jobs no way to know • How do we design a predictive algorithm to estimate the opportunity cost of a job?
Presentation Layout • Introduction and Motivation • Background on QoPS and QoS Cost Models • Minimizing Opportunity Cost with Value-aware QoPS • Dynamic “Self-learning” Value-aware QoPS • Performance Results • Conclusions
QoPS: QoS for Parallel Job Scheduling • Advanced Reservation (before QoPS) • Before QoPS, the only way to guarantee a turnaround time • Execution time window statically decided upfront • Resources underutilized due to fragmentation • If resources are available early, the job can’t be rescheduled • Primary Goals of QoPS: • Provide admission control • When a new job arrives: • Reorder existing jobs to find feasible schedules • Select the best feasible schedule • Ensure deadline guarantees for the accepted jobs • A later arriving job cannot force an existing job to miss its deadline!
Supercomputer Cost Model • Most supercomputer centers today do not provide QoS • Jobs are scheduled in a best-effort manner • Thus, no special cost models for QoS either • Some supercomputers provide prioritization (e.g., NERSC) • Different queues of jobs exist • More expensive queues get higher priority • For QoS-driven supercomputers, a new model required • Provider-centric: Supercomputer-center determines the charge • User-centric: User offers the price / bid
Market-based User-centric Cost Model • User offers a price to the system • Market-based bidding system • Proposed by Culler and Chase • Price offered reduces with time (decay factor) • Offered price touches zero at the job deadline time Maximum Revenue Revenue Deadline Time
Presentation Layout • Introduction and Motivation • Background on QoPS and QoS Cost Models • Minimizing Opportunity Cost with Value-aware QoPS • Dynamic “Self-learning” Value-aware QoPS • Performance Results • Conclusions
Value-aware QoPS (VQoPS) • Job acceptance based on two criteria: • The deadline should be achievable (evaluated using QoPS) • The job should provide enough revenue so as to offset a statically assumed opportunity cost • Product a fixed opportunity cost factor (OC-Factor) and the size of the job (i.e., number of processor-hours requested) • Large jobs (more nodes or long running) have a higher opportunity cost since they can potentially impact more later arriving jobs • The OC-Factor has to be tuned by the system administrator based on the expected workload! • Complicated to evaluate • Difficult to adapt if workload changes
VQoPS: An Example Scenario Less than static opportunity cost (C) J5 (500$) D4 D5 J4 (10$) J3 Running Jobs Processors J2 J1 Time Current Time By not scheduling J4, we retained the future opportunity to schedule the more expensive job J5 Choosing the right OC-Factor is important for the scheme to be effective
VQoPS performance for different traces • No single static OC-Factor is best for all cases. • Best OC-Factor is dependent on trace characteristics.
Presentation Layout • Introduction and Motivation • Background on QoPS and QoS Cost Models • Minimizing Opportunity Cost with Value-aware QoPS • Dynamic “Self-learning” Value-aware QoPS • Performance Results • Conclusions
Dynamic “Self-learning” Value-aware QoPS • Estimate OC-Factor dynamically for best revenue gain • OC-Factor depends on • System Load • Relative frequency of urgent jobs • Relative price of urgent jobs • DVQoPS considers a history-based adaptive technique to consider all of the factors • Perform a what-if simulation by rolling back and find the best OC-Factor
OC Factor = O O1 O1 O2 O2 O3 O3 ON ON What-if Simulations in DVQoPS OC Factor = O OC Factor = O3 OC Factor = O2 O3 gave us the best revenue pick O3 O2 gave us the best revenue pick O2 We dynamically pick the OC-Factor that gave the best revenue in the previous roll-back interval
Balancing Sensitivity and Stability Sensitivity: Too long a rollback window loses sensitivity to small changes in the workload Stability: Too short a rollback window loses stability and causes the results to be noisy Need to calculate rollback window dynamically Impact of Rollback Window Size
Presentation Layout • Introduction and Motivation • Background on QoPS and QoS Cost Models • Minimizing Opportunity Cost with Value-aware QoPS • Dynamic “Self-learning” Value-aware QoPS • Performance Results • Conclusions
Simulation Setup • Two categories of jobs • Urgent Jobs • Normal Jobs • Job Mixes (Urgent, Normal): • (80%, 20%), (50%, 50%), (20%, 80%) • Urgency factor: • Urgent job Revenue = URG_FACT x Normal Job Revenue • URG_FACT used 10, 5, 2 • URG_FACT refers to the height and steepness of the cost model curve
Impact of Job Mix (% of Urgent Jobs) DVQoPS performs within 2-3% of the best VQoPS implementation
Service Differentiation and Job Urgency DVQoPS provides appropriate amount of service differentiation depending on the cost difference As job urgency increases, higher VQoPS values perform better DVQoPS automatically adjusts itself
Impact of Inaccurate User Estimates • Overall improvement in revenue drops considerably • Inaccurate estimates result in a lot of wastage due to strict provisioning • DVQoPS still performs within 2% of the best VQoPS implementation • 15% better than QoPS
Presentation Layout • Introduction and Motivation • Background on QoPS and QoS Cost Models • Minimizing Opportunity Cost with Value-aware QoPS • Dynamic “Self-learning” Value-aware QoPS • Performance Results • Conclusions
Concluding Remarks and Future Work • QoS in Scheduling is a new concept with growing interest • Schemes such as QoPS (our previous work) that provide deadlines exist, but they do not deal with system revenue • In this paper, we analyzed the behavior of systems when a cost model is introduced • System dynamism adds a new parameter “Opportunity Cost” which makes the issue unpredictable • We presented two schemes, VQoPS and DVQoPS, which analyze Opportunity cost and minimize its impact • Simulations show up to 200% better performance in some cases • Future Work: Integrating QoS and prioritization and incorporating the code into standard schedulers
Thank You! Contacts: M. Islam: islammo@cse.ohio-state.edu P. Balaji: balaji@mcs.anl.gov G. Sabin: gsabin@rnet-tech.com P. Sadayappan: saday@cse.ohio-state.edu Web pointers: http://www.mcs.anl.gov/~balaji
JN J6 J5 J4 J3 J2 JN J1 J6 J6 J6 J6 J6 J6 J6 J5 J5 J5 J5 J5 J5 J5 J4 J4 J4 J4 J4 J4 J4 J3 J3 J3 J3 J3 J3 J3 J2 J2 J2 J2 J2 J2 J2 J3 J1 J1 J1 J1 J1 J1 J1 J6 J5 J4 J6 J5 J4 J3 J2 JN J6 J5 J6 J4 J5 J3 J4 J2 J2 J6 J5 J4 J2 J6 J5 J4 J3 J6 J5 J4 J2 J3 JN JN JN JN J3 JN J2 J3 J2 JN JN JN JN JN JN J1 J1 J1 J1 J1 J1 J1 QoPS: An Example Scenario MAX_ALLOWED_VIOLATION = 2 CURRENT_VIOLATION = 0 CURRENT_VIOLATION = 1
Rollback Interval • Effective rollback interval is estimated in every MAX_ROLLBACK_INTERVAL (e.g. 128 Hr) • MaxRevenue = Revenue (currentSchedule) • For each testInterval in {1hr, 4hr, 16hr, 64hr, 128Hr} • Run what-if simulation by rolling back testInterval • Revenue = Calculate revenue of the schedule • If Revenue > MaxRevenue • MaxRevenue = Revenue • Effective Rollback Interval = testInterval • End for