Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory

Previous work • M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris, "Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22, 2002. • G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April 2001 . Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Overview • Tiling for parallelization • Non-overlapping vs. Overlapping execution scheme • Grouping • Application on a cluster of SMPs with a fixed number of nodes • Experimental-Simulation Results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Nested For-Loops for (i1=l1; i1<=u1; i1++) for (i2=l2; i2<=u2; i2++) … … … … … for (in=ln; in<=un; in++) { Loop Body } Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Dependence Vectors i2 for (i1=0; i1<=7; i1++) for (i2=0; i2<=7; i2++) A[i,j]=A[i-1,j]+A[i,j-1] i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Tiling i2 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Tiling i2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Non-Overlapping Scheme i2 Processor 2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

P3 P3 P2 P2 P1 P1 P0 P0 Non-Overlapping vs. Overlapping Scheme Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Overlapping Scheme i2 Processor 2 Processor 1 Processor 0 i1 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Generalization to SMPs – “Grouping” CPU1 SMP3 CPU0 CPU1 SMP2 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Example: Grouping + Non overlapping Communication Scheme Group Space Tile Space SMP node1 SMP node0 Scheduling vector Π=(1,0) Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Example: Grouping + Overlapping Communication Scheme Group Space Tile Space SMP node1 SMP node0 Scheduling vector Π=(1,1) Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Scheduling onto a Fixed Number of SMPs • Dynamic Scheduling by the Operating System • Run time overhead for generating a lot of processes • Context switching slows down the execution • Static Scheduling at Compile Time Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Scheduling onto a Fixed Number of SMPs • Cyclic Assignment Schedule • Mirror Assignment Schedule • Cluster Assignment Schedule • Retiling Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Cyclic Assignment Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

chunk CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Cyclic Assignment Cyclic assignment on 2 SMP nodes with 2 CPUs each chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Cyclic Assignment – Non Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Cyclic Assignment - Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

chunk CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Cyclic Assignment - Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

chunk Mirror Assignment Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 SMP0 CPU1 CPU0 SMP1 CPU1 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 chunk Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Mirror Assignment – Non Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Mirror Assignment - Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Mirror Assignment - Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 SMP0 CPU1 CPU0 SMP1 CPU1 CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Cluster Assignment Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 tiles “TILE” CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

GROUPS TILES Cluster Assignment Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Cluster Assignment – Non Overlapping Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Cluster Assignment –Overlapping Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

GROUPS TILES Cluster Assignment - Communication Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Retiling Retiling on 2 SMP nodes with 2 CPUs each CPU1 old tiles new tiles SMP1 CPU0 CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Retiling Retiling on 2 SMP nodes with 2 CPUs each CPU1 old tiles new tiles SMP1 CPU0 retaining computation volume of a tile CPU1 SMP0 CPU0 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Retiling – Non Overlapping Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Retiling –Overlapping Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0 t Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Retiling - Communication Retiling on 2 SMP nodes with 2 CPUs each CPU1 SMP1 CPU0 CPU1 SMP0 CPU0  Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Experimental Platform • Linux SMP (Symmetric Multi-Processors) Cluster • 2 nodes • 1GB RAM • 2 Pentium III 1266MHz • Myrinet high performance interconnect • GM low level message passing system Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

The Myrinet interconnect • User-level Networking • Based on the GM message passing interface • All message exchange using DMA • Directly to/from pinned userspace buffers • Communication is offloaded to the NIC • Programmable NIC • LANai RISC processor @ 133-333MHz • 2-8MB SRAM • 2+2Gbps full duplex fiber links Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Application GM Library User GM kernel module Kernel NIC GM firmware GM Architecture • Comprised of three main parts • User library • Kernel driver • Firmware on NIC • OS bypass design • Regions of NIC memory mapped to the VM of a process Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Sending and Receiving messages over Myrinet/GM Sending application Receiving application Buffer Event q Buffer Event q Host Host NIC NIC Send q Host DMA Recv q Host DMA LANai LANai Send DMA Recv DMA Send DMA Recv DMA Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Initial Code for (i=1; i<=X; i++) for (j=1; j<=Y; j++) for (k=1; k<=Z; k++) { A[i][j][k] = func(A[i-1][j][k], A[i][j-1][k], A[i][j][k-1]) } Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Experimental results Non Overlapping Execution Scheme Overlapping Execution Scheme 1 1 retile 0.95 0.95 cluster cyclic 0.9 0.9 retile 0.85 0.85 0.8 0.8 cluster mirror Speedup / # processors Speedup / # processors 0.75 0.75 0.7 0.7 mirror 0.65 0.65 0.6 0.6 cyclic 0.55 0.55 0.5 0.5 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Height of Iteration Space Height of Iteration Space Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Non Overlapping Execution Scheme 1 0.9 0.8 retile 0.7 cluster Speedup / # processors 0.6 cyclic 0.5 mirror 0.4 0.3 0 4000 8000 12000 16000 20000 Height of Iteration Space Simulation results Overlapping Execution Scheme retile 1 cyclic mirror 0.9 0.8 0.7 Speedup / # processors cluster 0.6 0.5 0.4 0.3 0 4000 8000 12000 16000 20000 Height of Iteration Space Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Non Overlapping Execution Scheme Overlapping Execution Scheme retile 1 1 0.9 0.9 cyclic 0.8 0.8 retile cluster mirror 0.7 0.7 cluster Speedup / # processors Speedup / # processors 0.6 0.6 0.5 0.5 cyclic 0.4 0.4 mirror 0.3 0.3 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000 Height of Iteration Space Height of Iteration Space Simulation results Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Advantages - Disadvantages Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Presentation Transcript

Topic 6 Nested for Loops

Nested For Loops

Nested Loops – part 1

Nodes, Branches, and Loops

Nested Loops – part 3

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Nested Loops

CSC 1750 Nested for Loops

Nodes, Branches, and Loops

Nested Loops

Nested loops

Clustering SMP Nodes with the ATOLL Network: A Look into the Future of System Area Networks

Nested Loops – part 2

Variation in Number of Nodes

Adaptive Cyclic Scheduling of Nested Loops

Nested Loops Joins

Cluster scheduling

Nodes, Branches, and Loops

Nested For Loops

Nested loops

Nested for Loops

Nested Loops