170 likes | 283 Views
A Dynamic Space Sharing Method For Resource Management Gabriel Mateescu Research Computing Support Group National Research Council Canada Gabriel.Mateescu@nrc.ca HPCS 2001 presentation Windsor, Ontario, June 19-20, 2001. Agenda. Motivation Outline of the Approach Job Taxonomy
E N D
A Dynamic Space Sharing Method For Resource Management • Gabriel Mateescu Research Computing Support Group National Research Council Canada Gabriel.Mateescu@nrc.ca HPCS 2001 presentation Windsor, Ontario, June 19-20, 2001
Agenda • Motivation • Outline of the Approach • Job Taxonomy • Pseudocode • Evaluation
Motivation • Continuously increasing demand for computation resources is met by clusters, distributed or shared memory supercomputers • Dual objective: • optimization resource utilization, high throughput • quality of service to users: low turn-around time • Batch scheduling based on space sharing and static partitioning has limited scalability • Main contribution: provide a method for achieving both high resource utilization and low turn-around time
The Problem • Parallel supercomputer/cluster shared among a number of divisions • Dual objective: • optimization resource utilization, high throughput • quality of service to users: low turn-around time • Batch scheduling based on space sharing and static partitioning has limited scalability • Main contribution: provide a method for achieving both high resource utilization and low turn-around time
Example Parallel Computer Biotech Department Physics Department Node (CPU + memory) Partition boundary Job Requests
Outline of the Approach • Dynamic space sharing method for batch job scheduling • Partition the resources into a set of dedicated queues • Dedicated queues own resources • Free resources can be borrowed by pending jobs for which there are not enough per queue resources • Borrowed resources are grouped in a shared queue • Borrowed resources can be reclaimed by the lending queue • Reclaiming is done by checkpointing jobs which hold borrowed resources
Outline • The sum of the resources assigned to jobs in a dedicated queue does not exceed the resource limits of the queue • The difference between the total amount of resources and the resources currently assigned to the dedicated queues represents opportunity for scheduling jobs for which there are not enough per-queue resources • Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the shared queue
Dedicated Resources Job 1 in queue 1: 1 x resource 1 + 2 x resource 2 Job 2 in queue 2: 2 x resource 1 + 1 x resource 2 Resource 1 Resource 2 Queue 1 Queue 2
Borrowed Resources Job 3 in queue 1: 1 x resource 1 + 2 x resource 2 Resource 1 Resource 2 Queue 1 Queue 2
Resource Reclaiming Job 4 in queue 2: 1 x resource 1 + 2 x resource 2 Resource 1 Resource 2 Queue 1 Queue 2
Paths of a Job Submit queue new job Dedicated queue Dedicated queue Dedicated queue finished job Shared Queue
Job Taxonomy • master job has resource requirements which can be satisfied from the free resources available to the queue • fittable job uses resources which can be satisfied by reclaiming some resources borrowed by the shared queue • movable job has resource requirements which exceed the amount of resources owned by the queue and not already allocated to jobs; however, the requirements of such a job may be satisfied from the system-wide free resources • blocked job there are not enough resources, either owned by its queue or available in other queues, that can satisfy the job's requirements. Or the job is not checkpointable
Job State Transition Diagram Preempt slave new job pending job enough per-queue resources movable job master job fittable job preempt slave slave job start job running master running slave finished job
scheduler ( ) { queues = sort_dedicated_queues(); while ( scheduling_is_on ) { new_jobs = get jobs_in_submit queue(); dispatch_to_dedicated_queue(new_jobs); foreach queue in ( queues ) { jobs = get_pending_jobs(queue); order_jobs (jobs); foreach job in ( jobs ) { type = get_job_type(job); resources = get_job_resources(job); if ( type == master || type == fittable ) { if ( type == fittable) { victim_jobs = reclaim(resources); re_queue(preempted jobs); } allocate_resources(resources); start_job(job); } else if (type == movable ) { ok = system_resources(resources); if ( ok ) { move_to_shared_queue(job); start_job(job); } } } } } Pseudocode
Job Statistics • SGI Origin 2000 with 108 CPUs and 48 GB of main memory • Resources are partitioned among six dedicated queues defined for six groups of users • Average system load, including short interactive jobs ~ 94 • Total jobs running 33, CPUs allocated=103, memory=39 GB • Slave jobs running 11, CPUs allocated =22, memory=11 GB • Jobs Waiting 3 • Checkpoints/day per slave job ~1.5
Advantages • Combine the advantages of space sharing and time sharing scheduling • Space sharing gives resource allocation for the duration of the job and predictable execution time • Time sharing improves resource utilization • We combine space sharing with job preemption • Selection of which jobs are preempted is made in terms of the current usage of the resources, rather than based on a static job priority
Evaluation • Complexity: O(J N R + J log J) J = number of pending or slave jobs, N = number of supernodes; R = number of types of resources • Reduce the waiting time of the jobs by harnessing resources not used by the dedicated queues • Reduce job execution-time by reserving resources for all but the slave the jobs • No job fitting in a dedicated queue can be prevented from running by a slave job