210 likes | 238 Views
Explore new scheduling techniques to automate application-level scheduling using performance models. Learn about drawbacks of current workflow schedulers and innovative solutions presented at the VGrADS Workshop in Sep. 2005.
E N D
New Workflow Scheduling TechniquesPresentation: Anirban Mandal VGrADS Workshop @UCSD, Sep 2005
Outline • Drawbacks of Workflow Scheduler v.0 • Middle-Out Scheduling • Scheduling onto systems with batch queues • Scheduling onto Abstract Resource Classes Premise: Automating good application level scheduling using performance models by taking advantage of vgES features
While all available components not mapped For each (component, resource) pair ECT(c,r) = rank(c,r) + EAT(r) End For each Run min-min, max-min and sufferage Store mapping End while Top-Down Scheduling For each heuristic Until all components mapped Map available components to resources Select mapping with minimum makespan Top-Down
Drawbacks of Workflow Scheduler v.0 • Top-Down Workflow Scheduler suffers from • Myopia • Top-down traversal implies no look ahead • Potential of poor mapping of critical steps for decisions taken higher up in the workflow • Assumption of instant resource availability • Many systems have batch queue front ends • Have to wait before job starts • Scaling problems • Scheduling onto individual nodes pose scaling problems in large resource environments - Issue raised at the site-visit
Ryan’s talk Addressing the Drawbacks • We address the drawbacks as follows • Myopia • Middle-Out Scheduling • Schedule critical step first and propagate mapping up and down • Assumption of instant resource availability • Incorporating batch queue wait times to take scheduling decisions (Joint work: Rice+UCSB) • Scaling problems • Using a two-level scheduling strategy - explicit resource pruning using vgDL/other means and then scheduling (Joint Work: Rice+UCSD+Hawaii) • Scheduling onto abstract resource classes / clusters
Middle-Out Scheduling Key step Top-Down Middle-Out
Middle-Out Scheduling: Results • Compared makespans for middle-out vs. top-down scheduling • Resource set: 5 clusters [2 Opteron clusters and 3 Itanium clusters] • 6 resource-topology scenarios : combination of Opteron clusters close, normal and far with Fast and Slow Itaniums - {(Opteron close, Fast Itanium), ..} • Application: Actual EMAN DAG with 3 different communication-to-computation ratios (CCR): 0.1, 1 and 10 • Used known performance model values for computational components • Varied file sizes to obtain desired CCR for each pair of synchronization points
Middle-Out Scheduling: Results • CCR: 0.1 • Computation 10 times the communication • Fast Itanium makes top-down scheduler to “get stuck” at the Itanium clusters • Since key computation step is scheduled on both the Opteron clusters, makespan depends on the Opteron connectivity • In the slow Itanium case, the top-down scheduler “got lucky” • Gain from middle-out scheduling not much
Middle-Out Scheduling: Results • CCR: 1 • Equal communication and computation • Fast Itanium makes top-down scheduler to “get stuck” at the Itanium clusters • Since key computation step is scheduled on both the Opteron clusters, makespan depends on the Opteron connectivity • In the slow Itanium case, the top-down scheduler “got lucky”
Middle-Out Scheduling: Results • CCR: 10 • Communication 10 times the computation • Fast Itanium makes top-down scheduler to “get stuck” at the Itanium clusters • Since key computation step is scheduled on both the Opteron clusters, makespan depends on the Opteron connectivity • In the slow Itanium case, the top-down scheduler “got lucky”
Middle-Out Scheduling: Results • With increasing communication, the middle-out scheduler performs better when the top-down scheduler gets stuck
Outline • Drawbacks of Workflow Scheduler v.0 • Middle-Out Scheduling • Scheduling onto systems with batch queues • Scheduling onto Abstract Resource Classes
Scheduling onto Batch-Queue Systems • Incorporated Point-value predictions for batch queue wait times • Slight modification to the top-down scheduler • At every scheduling step, take into account the estimated time the job has to wait in the queue in the estimated completion time for the job [ECT(c,r) in the algorithm] • Keep track of the queue wait times for each cluster and the number of nodes that correspond to the queue wait time • With each mapping, update the estimated availability time [EAT in the algorithm] with the queue wait time, as required Joint work with Dan Nurmi and Rich Wolski
Scheduling onto Batch-Queue Systems: Example Cluster 1 Cluster 0 Input DAG R1 R0 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T
Scheduling onto Batch-Queue Systems: Example Cluster 1 Cluster 0 Input DAG R1 R0 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T
Outline • Drawbacks of Workflow Scheduler v.0 • Middle-Out Scheduling • Scheduling onto systems with batch queues • Scheduling onto Abstract Resource Classes • Addressing the scaling problem • Modify scheduler to schedule onto clusters instead of individual nodes
Scheduling onto Clusters • Input: • Workflow DAG with restricted structure - nodes at the same level do the same computation • Set of available Clusters (numNodes, arch, CPU speed etc.) and inter-cluster network connectivity • Per-node performance models for each cluster • Output: • Mapping: for each level the number of instances mapped to each cluster • Objective: • Minimize makespan at each step
Scheduling onto Clusters: Modeling • Abstract modeling of mapping problem for a DAG level • Given: • N instances • M clusters • r1..rM nodes/cluster • t1..tM - rank value per node per cluster (incorporates both computation and communication) • Aim: • To find a partition (n1, n2,… nM) of N such that overall time is minimized with n1+n2+..nM = N • Analytical solution: • No ‘obvious’ solution because of the discrete nature
Scheduling onto Clusters • Iterative approach to solve the problem • Addresses the scaling issue For each instance, i from 1 to N For each cluster, j from 1 to M Tentatively map i onto j Record makespan for each j by taking care of round(j) End For each Find cluster, p with minimum makespan increase Map i to p Update round(p), numMapped(p) End For each O(#instances * #clusters)
Middle-Out Scheduling Key step Top-Down Middle-Out