530 likes | 736 Views
Rescheduling. Sathish Vadhiyar. Rescheduling Motivation. Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions in response to changes in resource performance is necessary Performance degradation of the running applications
E N D
Rescheduling Sathish Vadhiyar
Rescheduling Motivation • Heterogeneity and contention can cause application’s performance vary over time • Rescheduling decisions in response to changes in resource performance is necessary • Performance degradation of the running applications • Availability of “better” resources
Modeling the Cost of Redistribution • Cthreshold depends on: • Model accuracy • Load dynamics of the system
Redistribution Cost Model for Jacobi 2D • Emax – average iteration time of the processor that is farthest behind • Cdev – processor performance deviation variable
Experiments • 8 processors were used • A loading event consisting of parallel program was introduced 3 minutes after Jacobi started • Number of tasks of the loading event varied • Cthreshold – 15 seconds
Malleable Jobs • Parallel Jobs • Rigid – only one set of processors • Moldable – flexible during job starts, but cannot be reconfigured during execution • Malleable – flexible during job start as well as during execution
Rescheduling in GrADS • Performance-oriented migration framework • Tightly coupled policies for suspension and migration • Takes into account load characteristics, remaining execution times • Migration of application depends on: • The amount of increase or decrease in loads on the system • The time of the application execution when load is introduced into the system • The performance benefits that can be obtained due to migration Components: • Migrator • Contract Monitor • Rescheduler
SRS Checkpointing Library • End application instrumented with user-level checkpointing library • Enables reconfiguration of executing applications across distinct domains • Allows fault tolerance • Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpoints • Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel application • Simple API - SRS_Init() - SRS_Restart_Value() - SRS_Register() - SRS_Check_Stop() - SRS_Read() - SRS_Finish() - SRS_StoreMap(), SRS_DistributeFunc_Create(), SRS_DistributeMap_Create()
SRS INTERNALS MPI Application STOP Poll Runtime Support System (RSS) SRS STOP IBP IBP IBP Read with possible redistribution Start ReStart
/* begin code */ MPI_Init() /* initialize data */ loop{ } MPI_Finalize() /* begin code */ MPI_Init() SRS_Init() restart_value = SRS_Restart_Value() if(restart_value == 0){ /* initialize data */ } else{ SRS_Read(“data”, data, BLOCK, NULL) } SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL) loop{ stop_value = SRS_Check_Stop() if(stop_value == 1){ exit(); } } SRS_Finish() MPI_Finalize() SRS API SRS Instrumented code Original code
SRS Example – Original Code MPI_Init(&argc, &argv); local_size = global_size/size; if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } } MPI_Finalize();
SRS Example – Modified Code MPI_Init(&argc, &argv); SRS_Init(); local_size = global_size/size; restart_value = SRS_Restart_Value(); if(restart_value == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; } else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);
SRS Example – Modified Code (Contd..) for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); if(stop_value == 1){ MPI_Finalize(); exit(0); } proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } } SRS_Finish(); MPI_Finalize();
Components (Continued..) Contract Monitor: • Monitors the progress of the end application • Tolerance limits specified to the contract monitor • Upper contract limit – 2.0 • Lower contract limit – 0.7 • When it receives the actual execution time for an iteration from the application • calculates ratio between actual and predicted • Adds it to the average ratio • Adds it to the last_5_avg
Contract Monitor • If average ratio > upper contract limit • Contact rescheduler • Request for rescheduling • Receive reply • If reply is “SORRY. CANNOT RESCHEDULE” • Calculate new_predicted_time based on last_5_avg and orig_predicted_time • Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit • Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit • prev_predicted_time = new_predicted_time
Contract Monitor • If average ratio < lower contract limit • Calculate new_predicted_time based on last_5_avg and orig_predicted_time • Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, prev_upper_contract_limit • Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, prev_lower_contract_limit • prev_predicted_time = new_predicted_time
Rescheduler • A metascheduling service • Operates in 2 modes • When contract monitor requests for rescheduling – i.e. during performance degradation • Periodically queries Database manager for recently completed GrADS applications, migrates executing applications to make use of freed resources – i.e. opportunistic rescheduling
Application and Metascheduler Interactions User Problem parameters Resource Selection Initial list of machines Permission Service Requesting Permission Permission Get new resource information NO Permission? Abort YES Application Specific Scheduling Application specific schedule Contract Negotiator Contract Development Get new resource information Contract Approved? NO YES Problem parameters, final schedule Application Launching Get new resource information Application Completion? Application Completed Wait for restart signal Exit Application was stopped
Rescheduler Architecture Application Manager Get new resource information Application Launching Application Completion? Application Completed Wait for restart signal Exit Application was stopped Application Execution time Contract Monitor Request for migration Rescheduler Query for STOP signal Send STOP signal Runtime Support System (RSS) Database Manager Store STOP Store RESUME
Experiments and ResultsRescheduling on request • Different problem sizes of ScaLAPACK QR • msc – fast machines; opus – slow machines • Initial set of resources consisted of 4 msc and 8 opus machines • The performance model always chose 4 msc machines for application run • 5 minutes into the application run, artificial load is introduced on 4 msc machines • The application migrated from UT to UIUC Rescheduler decided not to reschedule for size 8000.Wrong decision! Rescheduling No rescheduling
Rescheduling Depending on Amount of Load • ScaLAPACK QR problem size – 12000 • Load introduced 20 minutes after application start • The amount of load was varied Rescheduler decided not to reschedule.Wrong decision! No rescheduling Rescheduling
Rescheduling Depending on Load Introduction Time • ScaLAPACK QR problem size – 12000 • Same load introduced at different points of application execution Rescheduler decided not to reschedule.Wrong decision! No rescheduling Rescheduling
Experiments and Results Opportunistic Rescheduling No rescheduling No rescheduling • Two problems – - 1st problem, size 14000 executing on 6 msc machines. - 2nd problem of varying sizes. • 2nd problem introduced 2 minutes after the start of 1st problem. • Initial set of resources for the 2nd problem consisted of 6 msc machines and 2 opus machines. • Due to the presence of 1st problem, the 2nd problem had to use both the msc and opus machines, hence involved Internet bandwidth. • After 1st problem completes, the 2nd problem can be rescheduled to use only the msc machines. No rescheduling No rescheduling Rescheduling Rescheduling Large problem Large problem Large problem Large problem
Dynamic Prediction of Rescheduling Cost • The rescheduler, during rescheduling decision, contacts RSS and obtains data distributions of data • Forms old and new data maps • Based on maps and current NWS information, predicts redistribution cost
Dynamic Prediction of Rescheduling Cost Application started on: 4 mscs Application restarted on: 8 opus
References / Sources / credits • Predicting the Cost of Redistribution in Schedulingby Gary Shao, Rich Wolski and Fran BermanProceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing • Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Framework for the Grid”. Proceedings of The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), pp 130-137, May 2003, Tokyo, Japan. • L. V. Kale, Sameer Kumar, and J. DeSouzaA Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. • See Cactus migration thorn • See opportunistic migration by Huedo
GridWay • Migration: • When performance degradation happens • When “better” resources are discovered • When requirements change • Owner decision • Remote resource failure • Rescheduling done at discovery interval • Performance degradation evaluator program executed at monitoring interval
Components • Request manager • Dispatch manager • Submission manager – prologing, submitting, canceling, epiloging • Performance monitor • Application specific components • Resource selector • Performance degradation evaluator • Prolog • Wrapper • epilog
Opportunistic Job Migration • Factors • Performance of new host • Remaining execution time of application • Proximity of new resource to the needed data
Dynamic Space sharing on clusters of non-dedicated workstations (Chowdhury et. al.) • Dynamic reconfiguration – application level approach for dynamic reconfiguration of grid-based iterative applications
SRS Overhead Worst case Overhead – 15% Worst case SRS Overhead of all results – 36 %
SRS Data Redistribution Cost Started on – 8 MSCs Restarted on – 8 OPUS, 2MSCs
Modified GrADS Architecture Resource Selector User MDS Grid Routine / Application Manager NWS Permission Service App Launcher Contract Developer Database Manager Performance Modeler RSS Application Contract Monitor Contract Negotiator Rescheduler
Another approach: AMPI • AMPI – MPI implementation on top of Charm++ • Processes implemented as user-level threads • Charm++ provides load balancing framework, migrates threads • The load balancing framework accepts processor map • Parallel job started on all processors in the system • Allocates work to only processors in the processor map, i.e. threads/objects are assigned to processors in the processor map
Rescheduling • When processor map changes • Threads are migrated to new set of processors in the processor map • Skeleton processes left behind in the vacated processors • A skeleton forwards messages to threads/objects previously housed in the processor • New processor conveyed to load balancer framework by adaptive job scheduler
Overhead • Shrink or expand time depends on: • per-process data that has to be transferred • Number of processors involved
Adaptive Job Scheduler • Variant of dynamic equipartitioning strategy • Each job specifies min. and max. number of procs. that it can run on. • The scheduler recalculates the number of procs. assigned to each running job • Running jobs and new job are first assigned the minimum requirement • The left over procs. are equally divided among all the jobs • The new job is assigned to a queue if it cannot be allocated its minimum requirement
Scheduling • Same strategy followed when jobs complete • The scheduler conveys the decision by bit-vector to jobs • Jobs do thread migration