Dynamic Multi Phase Scheduling for Heterogeneous Clusters

20th International Parallel and Distributed Processing Symposium • 25-29 April 2006 Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba†, Theodore Andronikos†, Ioannis Riakiotakis†, Anthony T. Chronopoulos‡ and George Papakonstantinou† • † National Technical University of Athens • Computing Systems Laboratory • ‡ University of Texas at San Antonio • cflorina@cslab.ece.ntua.gr • www.cslab.ece.ntua.gr

Outline • Introduction • Notation • Some existing self-scheduling algorithms • Dynamic self-scheduling for dependence loops • Implementation and test results • Conclusions • Future work IPDPS 2006

Introduction • Motivation for dynamically scheduling loops with dependencies: • Existing dynamic algorithms can not cope with dependencies, because they lack inter-slave communication • Static algorithms are not always efficient • In their original form, if dynamic algorithms are applied to loops with dependencies, they yield a serial/invalid execution IPDPS 2006

Notation • Algorithmic model: FOR (i1=l1; i1<=u1; i1++) FOR (i2=l2; i2<=u2; i2++) … FOR (in=ln; in<=un; in++) Loop Body ENDFOR … ENDFOR ENDFOR • Perfectly nested loops • Constant flow data dependencies • General program statements within the loop body • J – index space of an n-dimensional uniform dependence loop IPDPS 2006

Notation • u1– synchronization dimension, un – scheduling dimension • – set of dependence vectors • PE – processing element • P1,...,Pm– slaves • N – number of scheduling steps • Ci – chunk size at the i-th scheduling step • Vi–size (iteration-wise) of Ci along scheduling dimension un • VPk – virtual computing power of slave Pk • Qk – number of processes in the run-queue of slave Pk • –available computing power of slave Pk • – totalavailable computing power of the cluster IPDPS 2006

Some existing self-scheduling algorithms u2 VN ... Vi+1 Vi Vi-1 Ci+1 Ci Ci-1 D T SS T SS C SS ... u1 V1 • 3 self-scheduling algorithms: • CSS – Chunk Self-Scheduling, Ci = constant • TSS – Trapezoid Self-Scheduling, Ci = Ci-1 – D, where D – decrement, and the first chunk is F = |J|/(2×m) and the last chunk is L = 1. • DTSS – Distributed TSS, Ci = Ci-1 – D, where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1. • CSS and TSS are devised for homogeneous systems • DTSS improves on TSS for heterogeneous systems by selecting the chunk sizes according to: • the virtual computational power of the slaves, Vk • the number of processes in the run-queue of each PE, Qk IPDPS 2006

Some existing self-scheduling algorithms • |J|=5000×10000 • m = 10 slaves • CSS and TSS give the same chunk sizes both in dedicated and non-dedicated systems, respectively • DTSS adjusts the chunk sizes to match the different Akof slaves IPDPS 2006

More notation • SP– synchronization point • M – number of SPs inserted along synchronization dimension u1 • H – interval (iteration-wise) between two SPs along u1 • H– is the same for every chunk • SCi,j – the set of iterations of Cibetween SPj-1 and SPj • Ci = Vi× M × H • Current slave– the slave assigned chunk Ci • Previous slave– the slave assigned chunk Ci-1 IPDPS 2006

Self-scheduling with synchronization • Chunks are formed along scheduling dimension, here say u2 • SPsare inserted along synchronization dimension, u1 • Phase 1: Apply self-scheduling algorithms to the scheduling dimension • Phase 2: Insert synchronization points along the synchronization dimension IPDPS 2006

The inter-slave communication scheme SPj SPj+1 SPj+2 communication set t+1 Ci+1 Ci Ci-1 Pk+1 Pk Pk-1 set of points computed at moment t+1 t t+1 SCi,j+1 set of points computed at moment t t indicates communication SCi-1,j+1 auxiliary explanations • Ci-1 is assigned to Pk-1, Ci assigned to Pk and Ci+1 to Pk+1 • When Pk reaches SPj+1, it sends to Pk+1only the data Pk+1 requires (i.e., those iterations imposed by the existing dependence vectors) • Afterwards, Pk receives from Pk-1 the data required for the current computation • Slaves do not reach a SP at the same time, which leads to a wavefrontexecution fashion IPDPS 2006

Dynamic Multi-Phase Scheduling DMPS(x) INPUT: (a) An n-dimensional dependence nested loop. (b) The choice of the algorithm CSS, TSS or DTSS. (c) If CSS is chosen, then chunk size Ci. (d) The synchronization interval H. (e) The number of slavesm; in case of DTSS, the virtual power Vk of every slave. Master: Initialization: (M.a) Register slaves. In case of DTSS, slaves report their Ak. (M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given Ci. While there are unassigned iterations do: (M.1) If a request arrives, put it in the queue. (M.2) Pick a request from the queue, and compute the next chunk size using CSS, TSS or DTSS. (M.3) Update the current and previous slave ids. (M.4) Send the id of the current slave to the previous one. IPDPS 2006

Dynamic Multi-Phase Scheduling DMPS(x) Slave Pk: Initialization: (S.a) Register with the master. In case of DTSS, report Ak. (S.b) Compute M according to the given H. (S.1) Send request to the master. (S.2) Wait for reply; if received chunk from master, go to step 3, else go to OUTPUT. (S.3) While the next SP is not reached, compute chunk i. (S.4) If id of the send-to slave is known, go to step 5, else go to step 6. (S.5) Send computed data to send-to slave (S.6) Receive data from the receive-from slave and go to step 3. OUTPUT Master: If there are no more chunks to be assigned, terminate. Slave Pk: If no more tasks come from master, terminate. IPDPS 2006

Dynamic Multi-Phase Scheduling DMPS(x) • Advantages of DMPS(x) • Can take as input any self-scheduling algorithm, without any modifications • Phase 2 is independent of Phase 1 • Phase 1 deals with the heterogeneity & load variation in the system • Phase 2 deals with minimizing the inter-slave communication cost • Suitable for any type of heterogeneous systems IPDPS 2006

Implementation and testing setup • The algorithms are implemented in C and C++ • MPI platform is used for master-slave and inter-slave communication • The heterogeneous system consists of 10 machines: • 4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots), assumed to have VPk = 1.5 (one of them is the master) • 6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids), assumed to have VPk = 0.5. • Interconnection network is Fast Ethernet, at 100Mbit/sec. • Dedicated system: all machines are dedicated to running the program and no other loads are interposed during the execution. • Non-dedicated system: at the beginning of program’s execution, a resource expensive process is started on some of the slaves, halving their Ak. IPDPS 2006

Implementation and testing setup • System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. • Three series of experiments for both dedicated & non-dedicated systems, for m = 3,4,5,6,7,8,9 slaves: • DMPS(CSS) • DMPS(TSS) • DMPS(DTSS) • Two real-life applications: heat equation, Floyd-Steinberg computation • Speedup Sp is computed with: where TPi – serial execution time on slave Pi, 1 ≤i≤m, and TPAR– parallel execution time (on m slaves) • In the plotting of Sp, VP is used instead of m on the x-axis. IPDPS 2006

Performance results – Heat equation IPDPS 2006

Performance results – Floyd-Steinberg IPDPS 2006

Interpretation of the results • Dedicated system: • as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one. • DMPS(TSS) slightly outperforms DMPS(CSS) for parallel loops, because it provides better load balancing • DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity • Non-dedicated system: • DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations • The speedup for DMPS(DTSS) increases in all cases • H must be chosen so as to maintain the comm/compratio < 1, for every test case • Even then, small variations of the value of H, do not significantly affect the overall performance. IPDPS 2006

Conclusions • Loops with dependencies can now be dynamically scheduled on heterogeneous dedicated & non-dedicated systems • Distributed algorithms efficiently compensate for the system’s heterogeneity for loops with dependencies, especially in non-dedicated systems IPDPS 2006

Future work • Establish a model for predicting the optimal synchronization interval H and minimize the communication • Extend all other self-scheduling algorithms, such that they can handle loops with dependencies and account for system’s heterogeneity IPDPS 2006

Thank you Questions? IPDPS 2006

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

Presentation Transcript

Dynamic Scheduling System

Heterogeneous Multi-Core Processors

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Dynamic Scheduling

Dynamic Scheduling

Dynamic scheduling

Online Performance Projection for Clusters with Heterogeneous GPUs

Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model)

Heterogeneous multi-core

Dynamic Scheduling

Dynamic Scheduling

Tomasulo Dynamic Scheduling

Multi-Site Scheduling

Dynamic instruction scheduling

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems Vignesh Ravi and Gagan Agrawal

Adaptable Virtual Machine Environment for Heterogeneous Clusters

L14: Dynamic Scheduling

Dynamic scheduling

L15: Dynamic Scheduling

Dynamic Scheduling and Dynamic Percolation

Two-type Heterogeneous Multiprocessor Scheduling: Is there a Phase Transition?

Heterogeneous Multi-Core Processors