10 likes | 133 Views
Problem Addressed. Simulation Environment. Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous cluster of processors, but processors at different sites have different speeds
E N D
Problem Addressed Simulation Environment • Scheduling of parallel jobs in a heterogeneous grid environment • Each site has a homogeneous cluster of processors, but processors at different sites have different speeds • Much of the research on scheduling in heterogeneous systems has focussed on independent sequential jobs • Research on parallel job scheduling has concentrated primarily on the homogeneous context • The algorithms used for scheduling sequential tasks on heterogeneous systems are too computationally complex to extend to parallel jobs • We extend the techniques used for parallel job scheduling in a homogeneous context to the heterogeneous context • Heterogeneous sites, with a homogeneous cluster of processors at each site • 5000 job subset of either the 430 node Cornell Theory Center (CTC) trace or the 128 node IBM SP2 system at the San Diego Supercomputer Center (SDSC) • NAS Parallel Benchmarks 2.0 used to model the heterogeneous runtimes SGI Origin 2000 IBM SP (WN/66) Cray T3E 900 IBM SP (P2SC 160 MHz) IS Class B (8 Nodes) 23.3 22.6 16.3* 17.7 MG Class B (8 Nodes) 35.5 34.3 25.3 17.2* MG Class B (256 Nodes) 1.3147 2.2724 1.8 1.1* Conservative vs. Arrgessive LU Class B (256 Nodes) 20.328* 94.893 35.6 24.2 * Denotes best runtime for the job Restricted Multi-Site Reservations • Jobs are processed in arrival order by the meta-scheduler • Greedy assigns each job to the site with the lowest instantaneous load • Greedy-MR (Multiple Requests) submits each job to all sites • When the job starts at a site, the other instances are removed • We have shown this mechanism to be effective in a homogenous context (HPDC ’02) • However, only a slight improvement is seen in a heterogeneous context • When network bandwidth is limited, jobs can be submitted to a smaller number of sites and the multi-reservation scheduler can still realize a substantial fraction of the benefits achievable from a scheduler that schedules each job at all sites, by making fewer reservations • Use a more accurate approach to select which sites the job is submitted, instead of using the instantaneous load, we query the site to determine the earliest completion time • When using completion time as the criteria, submitting to fewer sites can be almost as effective as submitting to all sites Conclusions and Future Work Efficacy Based Queues • Improvement in turn around time and effective utilization for parallel job scheduling in a heterogeneous grid environment have been demonstrated through simulation • Next Steps: • Incorporate these changes into the Silver/Maui Scheduler • Deploy and evaluate the scheduler first on our research clusters, and then at the Ohio Super Computer Center • Explicitly take into account efficacy to improve the effective utilization • Use efficacy as the priority order for the jobs in the reserved and idle queue • Starvation free • Effective utilization increases and turn around time decreases, in spite of the decreases in raw utilization A Characterization of Approaches to Parrallel Job Scheduling Gerald Sabin Rajkumar Kettimuthu Arun Rajan P SadayappanSupported in part by Sandia National Laboratory Metrics • We use the following metrics for evaluating the proposed schemes • Average Slowdown • Average Turnaround Time • Utilization • Effective Utilization Backfilling • Backfilling • A later arriving job is allowed to leap frog previously queued jobs Aggressive vs. Conservative Processors Processors Time Time Processors Time • Jobs are processed in arrival order by the meta-scheduler • In a heterogeneous context, the site where the job starts the earliest may not be the best site • In order to get the completion of a job at a particular site, conservative backfilling has to be employed at the local site • Conservative performs better than aggressive in all case, quite the opposite of a homogenous context • Improved backfilling caused by holes created due to the dynamic removal of replicated jobs at each site, and an increased number of jobs to attempt to backfill at each site • Conservative • Every job is given a reservation when it enters the system and a job is allowed to backfill only if it does not violate any of the previous reservations. • EASY • Only the job at the head of the queue is given a reservation and a job is allowed to backfill if it does not violate this reservation