Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison

Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison IEEE Transactions on Parallel and Distributed Systems, Vol. 20, No. 2, February 2009 Menno Dobber, Student Member, IEEE, Rob van der Mei, and Ger Koole Present by Chen, Ting-Wei

Index • Introduction • Preliminaries • Experimental Setup • Experimental Results • Conclusions Chen, Ting-Wei

Introduction (cont.) • Dynamics of grid environments • Dynamic Load Balancing • Job Replication • Easy-to-measure statistic Y Corresponding threshold value Y* • If Y>Y* ……DLB outperforms JR • If Y<Y* ……JR outperforms DLB Chen, Ting-Wei

Introduction (cont.) • Easy-to-implement approach • Make dynamic decisions about whether to use DLB or JR • Two types of investigations accurately verify • Trace-driven simulation • Real implementation Chen, Ting-Wei

Introduction (cont.) • Real implementation • To acquire more knowledge about DLB • Means of trace-driven simulations • Require detailed knowledge about the processes • Take less time • More extensive analyses can be performed Chen, Ting-Wei

Introduction (cont.) • Analyze and compare the effectiveness of ELB, DLB, and JR • Using trace-driven simulations • Gathering from a global-scale grid testbed Chen, Ting-Wei

Preliminaries (cont.) • Bulk Synchronous Processing (BSP) • Problem can be divided into subproblems or jobs • I iterations, P jobs, P processes • Each processor receives one job per iteration • After computing the jobs, all the processors send their data and wait for each others data before the next iteration starts • The standard BSP program is implemented according to the ELB principle Chen, Ting-Wei

Preliminaries (cont.) • Implementations on ELB Chen, Ting-Wei

Preliminaries (cont.) • Dynamic Load Balancing (DLB) • DLB starts with the execution of an iteration is the same with BSP • At the end of each iteration, the processors predict their processing speed for the next iteration • Select one processor to be the DLB scheduler • After every N iterations, the processors send their prediction to this scheduler Chen, Ting-Wei

Preliminaries (cont.) • The processor calculate the “optimal” distribution • Send relevant information to each processor • All processors redistribute the load Chen, Ting-Wei

Preliminaries (cont.) • Implementations on DLB Chen, Ting-Wei

Preliminaries (cont.) • Job Replication (JR) • Two copies of a job • R copies of all P jobs have been distributed to P processors. • A processor has finished one of the copies, it sends a message to the other processors • The other processors can kill the job and start the next job Chen, Ting-Wei

Preliminaries (cont.) • Implementations on JR Chen, Ting-Wei

Experimental Setup (cont.) • Data-Collection Procedure Chen, Ting-Wei

Experimental Setup (cont.) • Completely available Pentium 4, 3.0-GHz processor, the computations in the jobs would take 10000 ms • Set one’s job times are 72500 ms (average) • Distributed within the USA • More coherence between the generated datasets • Set two’s job times are 65000 ms (average) • Show more burstiness and have higher differences between the average job times on the processors • Globally distributed Chen, Ting-Wei

Experimental Setup (cont.) • Trace-driven simulation analyses • with • , and with Chen, Ting-Wei

Experimental Setup (cont.) • Simulation Details • Trace-driven DLB simulations • Assume a linear relation between the job size and their job times in BSP Chen, Ting-Wei

Experimental Setup (cont.) • DLB simulation • Randomly select a resource set • The DES-based prediction • Derive the IT • Derive the runtime of the R-JR • Derive the expected runtime of a DLB run Chen, Ting-Wei

Experimental Setup (cont.) • JT simulation • The same with step one of the DLB simulation • Divide the set of processors in execution groups • Drive the effective job times for all P processors • Derive the IT by repeating step two R times • Derive the runtime of the R-JR run by repeating step three • Derive the expected runtime of an R-JR run on P processors Chen, Ting-Wei

Experimental Setup (cont.) • Dynamic Selection Method • Analysis Chen, Ting-Wei

Experimental Results (cont.) • Simulate the runtimes of DLB for different numbers of processors with set one and two • Simulate runs of BSP parallel applications that use JR and analyze the expected speedups for different numbers of processors, replication, data sets and CCR values Chen, Ting-Wei

Experimental Results (cont.) • Compare the results of the runtimes and the speedups of the ELB, DLB, and JR • Simulate the speedups of the proposed selection method Chen, Ting-Wei

DLB Experimental Results (cont.) Chen, Ting-Wei

Job Replication Experimental Results (cont.) Chen, Ting-Wei

Comparison of ELB, DLB, and JR Runtimes of DLB and JR with CCR 0.01 Experimental Results (cont.) Chen, Ting-Wei

Speedups of DLB and JR with sets of 40 and 90 data sets with CCR 0.01 Experimental Results (cont.) Chen, Ting-Wei

Experimental Results (cont.) • Statistic Y against ITs of DLB and JR Chen, Ting-Wei

Experimental Results (cont.) • Speedup of selection method, DLB and JR Chen, Ting-Wei

Conclusions • Made an extensive assessment and comparison between DLB and JR • Y>Y* ……DLB outperforms JR • Y<Y* ……JR outperforms DLB • Propose the so-called DLB/JR method Chen, Ting-Wei

Outlook • Bring the result to a higher level of reality • Make use of mathematical techniques to provide a more solid foundation • Determine the optimal number of job replicas needed to obtain the best speedup performance Chen, Ting-Wei

Thanks for your attention

Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison