400 likes | 520 Views
Delivering High Performance to Parallel Applications Using Advanced Scheduling. Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr
E N D
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
Introduction • Motivation: • A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! • There is no complete method to generate code for non-rectangular tiles Parallel Computing 2003
Introduction • Contribution: • Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces • Simulation of blocking and non-blocking communication primitives • Experimental evaluation of proposed scheduling scheme Parallel Computing 2003
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
Background Algorithmic Model: FOR j1 = min1 TO max1 DO … FOR jn = minn TO maxn DO Computation(j1,…,jn); ENDFOR … ENDFOR • Perfectly nested loops • Constant flow data dependencies (D) Parallel Computing 2003
Background Tiling: • Popular loop transformation • Groups iterations into atomic units • Enhances locality in uniprocessors • Enables coarse-grain parallelism in distributed memory systems • Valid tiling matrix H: Parallel Computing 2003
Tiling Transformation Example: FOR j1=0 TO 11 DO FOR j2=0 TO 8 DO A[j1,j2]:=A[j1-1,j2] + A[j1-1,j2-1]; ENDFOR ENDFOR Parallel Computing 2003
j2 3 0 P = 0 3 1/3 0 H = 0 1/3 j1 Rectangular Tiling Transformation Parallel Computing 2003
j2 3 3 P = 0 3 1/3 -1/3 H = 0 1/3 j1 Non-rectangular Tiling Transformation Parallel Computing 2003
Why Non-rectangular Tiling? • Reduces communication 8 communication points 6 communication points • Enables more efficient scheduling schemes 6 time steps 5time steps Parallel Computing 2003
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
Computation Distribution • We map tiles along the longest dimension to the same processor because: • It reduces the number of processors required • It simplifies message-passing • It reduces total execution times when overlapping computation with communication Parallel Computing 2003
j2 P3 P2 P1 j1 Computation Distribution Parallel Computing 2003
Data Distribution • Computer-owns rule: Each processor owns the data it computes • Arbitrary convex iteration space, arbitrary tiling • Rectangular local iteration and data spaces Parallel Computing 2003
Data Distribution Parallel Computing 2003
Data Distribution Parallel Computing 2003
Data Distribution Parallel Computing 2003
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
j2 P3 P2 P1 j1 Communication Schemes • With whom do I communicate? Parallel Computing 2003
j2 P3 P2 P1 j1 Communication Schemes • With whom do I communicate? Parallel Computing 2003
Communication Schemes • What do I send? Parallel Computing 2003
Blocking Scheme j2 P3 P2 P1 j1 12 time steps Parallel Computing 2003
Non-blocking Scheme j2 P3 P2 P1 j1 6 time steps Parallel Computing 2003
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
Tiling Transformation Parallel SPMD Code Sequential Code Sequential Tiled Code Code Generation Summary Parallelization Tiling • Computation Distribution • Data Distribution • Communication Primitives Dependence Analysis Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Parallel Computing 2003
Code Summary – Blocking Scheme Parallel Computing 2003
Code Summary – Non-blocking Scheme Parallel Computing 2003
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=p4, --with-comm=shared) • g++ compiler v.2.95.4 (-O3) • FastEthernet interconnection • 2 micro-kernel benchmarks (3D): • Gauss Successive Over-Relaxation (SOR) • Texture Smoothing Code (TSC) • Simulation of communication schemes Parallel Computing 2003
SOR • Iteration space M x N x N • Dependence matrix: • Rectangular Tiling: • Non-rectangular Tiling: Parallel Computing 2003
SOR Parallel Computing 2003
SOR Parallel Computing 2003
TSC • Iteration space T x N x N • Dependence matrix: • Rectangular Tiling: • Non-rectangular Tiling: Parallel Computing 2003
TSC Parallel Computing 2003
TSC Parallel Computing 2003
Overview • Introduction • Background • Code Generation • Computation/Data Distribution • Communication Schemes • Summary • Experimental Results • Conclusions – Future Work Parallel Computing 2003
Conclusions • Automatic code generation for arbitrary tiled spaces can be efficient • High performance can be achieved by means of • a suitable tiling transformation • overlapping computation with communication Parallel Computing 2003
Future Work • Application of methodology to imperfectly nested loops and non-constant dependencies • Investigation of hybrid programming models (MPI+OpenMP) • Performance evaluation on advanced interconnection networks (SCI, Myrinet) Parallel Computing 2003
Questions? http://www.cslab.ece.ntua.gr/~ndros Parallel Computing 2003