520 likes | 635 Views
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes. Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory. Previous work .
E N D
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory
Previous work • M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris, "Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22, 2002. • G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April 2001 . Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Nested For-Loops for (i1=l1; i1<=u1; i1++) for (i2=l2; i2<=u2; i2++) … … … … … for (in=ln; in<=un; in++) { Loop Body } Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Dependence Vectors i2 for (i1=0; i1<=7; i1++) for (i2=0; i2<=7; i2++) A[i,j]=A[i-1,j]+A[i,j-1] i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Tiling i2 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Tiling i2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Non-Overlapping Scheme i2 Processor 2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
P3 P3 P2 P2 P1 P1 P0 P0 Non-Overlapping vs. Overlapping Scheme Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Overlapping Scheme i2 Processor 2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Generalization to SMPs – “Grouping” CPU1 SMP3 CPU0 CPU1 SMP2 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Example: Grouping + Non overlapping Communication Scheme Group Space Tile Space SMP node1 SMP node0 Scheduling vector Π=(1,0) Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Example: Grouping + Overlapping Communication Scheme Group Space Tile Space SMP node1 SMP node0 Scheduling vector Π=(1,1) Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Scheduling onto a Fixed Number of SMPs • Dynamic Scheduling by the Operating System • Run time overhead for generating a lot of processes • Context switching slows down the execution • Static Scheduling at Compile Time Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Cyclic Assignment Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
chunk CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Cyclic Assignment Cyclic assignment on 2 SMP nodes with 2 CPUs each chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Cyclic Assignment – Non Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Cyclic Assignment - Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
chunk CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Cyclic Assignment - Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
chunk Mirror Assignment Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 SMP0 CPU1 CPU0 SMP1 CPU1 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Mirror Assignment – Non Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Mirror Assignment - Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Mirror Assignment - Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 SMP0 CPU1 CPU0 SMP1 CPU1 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Cluster Assignment Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 tiles “TILE” CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
GROUPS TILES Cluster Assignment Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Cluster Assignment – Non Overlapping Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Cluster Assignment –Overlapping Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
GROUPS TILES Cluster Assignment - Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Retiling Retiling on 2 SMP nodes with 2 CPUs each CPU1 old tiles new tiles SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Retiling Retiling on 2 SMP nodes with 2 CPUs each CPU1 old tiles new tiles SMP1 CPU0 retaining computation volume of a tile CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Retiling – Non Overlapping Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Retiling –Overlapping Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Retiling - Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Experimental Platform • Linux SMP (Symmetric Multi-Processors) Cluster • 2 nodes • 1GB RAM • 2 Pentium III 1266MHz • Myrinet high performance interconnect • GM low level message passing system Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
The Myrinet interconnect • User-level Networking • Based on the GM message passing interface • All message exchange using DMA • Directly to/from pinned userspace buffers • Communication is offloaded to the NIC • Programmable NIC • LANai RISC processor @ 133-333MHz • 2-8MB SRAM • 2+2Gbps full duplex fiber links Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Application GM Library User GM kernel module Kernel NIC GM firmware GM Architecture • Comprised of three main parts • User library • Kernel driver • Firmware on NIC • OS bypass design • Regions of NIC memory mapped to the VM of a process Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Sending and Receiving messages over Myrinet/GM Sending application Receiving application Buffer Event q Buffer Event q Host Host NIC NIC Send q Host DMA Recv q Host DMA LANai LANai Send DMA Recv DMA Send DMA Recv DMA Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Initial Code for (i=1; i<=X; i++) for (j=1; j<=Y; j++) for (k=1; k<=Z; k++) { A[i][j][k] = func(A[i-1][j][k], A[i][j-1][k], A[i][j][k-1]) } Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Experimental results Non Overlapping Execution Scheme Overlapping Execution Scheme 1 1 retile 0.95 0.95 cluster cyclic 0.9 0.9 retile 0.85 0.85 0.8 0.8 cluster mirror Speedup / # processors Speedup / # processors 0.75 0.75 0.7 0.7 mirror 0.65 0.65 0.6 0.6 cyclic 0.55 0.55 0.5 0.5 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Height of Iteration Space Height of Iteration Space Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Non Overlapping Execution Scheme 1 0.9 0.8 retile 0.7 cluster Speedup / # processors 0.6 cyclic 0.5 mirror 0.4 0.3 0 4000 8000 12000 16000 20000 Height of Iteration Space Simulation results Overlapping Execution Scheme retile 1 cyclic mirror 0.9 0.8 0.7 Speedup / # processors cluster 0.6 0.5 0.4 0.3 0 4000 8000 12000 16000 20000 Height of Iteration Space Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Non Overlapping Execution Scheme Overlapping Execution Scheme retile 1 1 0.9 0.9 cyclic 0.8 0.8 retile cluster mirror 0.7 0.7 cluster Speedup / # processors Speedup / # processors 0.6 0.6 0.5 0.5 cyclic 0.4 0.4 mirror 0.3 0.3 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000 Height of Iteration Space Height of Iteration Space Simulation results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Advantages - Disadvantages Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes