460 likes | 641 Views
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr. Overview.
E N D
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr
Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Introduction • Motivation: • SMP clusters • Hybrid programming models • Mostly fine-grain MPI-OpenMP paradigms • Mostly DOALL parallelization EuroPVM/MPI 2003
Introduction • Contribution: • 3 programming models for the parallelization of nested loops algorithms • pure MPI • fine-grain hybrid MPI-OpenMP • coarse-grain hybrid MPI-OpenMP • Advanced hyperplane scheduling • minimize synchronization need • overlap computation with communication EuroPVM/MPI 2003
Introduction Algorithmic Model: FOR j0 = min0 TO max0 DO … FOR jn-1 = minn-1 TO maxn-1 DO Computation(j0,…,jn-1); ENDFOR … ENDFOR • Perfectly nested loops • Constant flow data dependencies EuroPVM/MPI 2003
Introduction Target Architecture: SMP clusters EuroPVM/MPI 2003
Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Pure MPI Model • Tiling transformation groups iterations into atomic execution units (tiles) • Pipelined execution • Overlapping computation with communication • Makes no distinction between inter-node and intra-node communication EuroPVM/MPI 2003
Pure MPI Model Example: FOR j1=0 TO 9 DO FOR j2=0 TO 7 DO A[j1,j2]:=A[j1-1,j2] + A[j1,j2-1]; ENDFOR ENDFOR EuroPVM/MPI 2003
j2 j1 Pure MPI Model CPU1 NODE1 CPU0 4 MPI nodes CPU1 NODE0 CPU0 EuroPVM/MPI 2003
j2 j1 Pure MPI Model CPU1 NODE1 CPU0 4 MPI nodes CPU1 NODE0 CPU0 EuroPVM/MPI 2003
Pure MPI Model tile0 = nod0; … tilen-2 = nodn-2; FOR tilen-1 = 0 TO DO Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); END FOR EuroPVM/MPI 2003
Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Hyperplane Scheduling • Implements coarse-grain parallelism assuming inter-tile data dependencies • Tiles are organized into data-independent subsets (groups) • Tiles of the same group can be concurrently executed by multiple threads • Barrier synchronization between threads EuroPVM/MPI 2003
j2 j1 Hyperplane Scheduling CPU1 2MPI nodes NODE1 CPU0 x 2OpenMP threads CPU1 NODE0 CPU0 EuroPVM/MPI 2003
j2 j1 Hyperplane Scheduling CPU1 2MPI nodes NODE1 CPU0 x 2OpenMP threads CPU1 NODE0 CPU0 EuroPVM/MPI 2003
Hyperplane Scheduling #pragma omp parallel { group0 = nod0; … groupn-2 = nodn-2; tile0 = nod0 * m0 + th0; … tilen-2 = nodn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ; if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier } } EuroPVM/MPI 2003
Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Fine-grain Model • Incremental parallelization of computationally intensive parts • Relatively straightforward from pure MPI • Threads (re)spawned at computation • Inter-node communication outside of multi-threaded part • Thread synchronization through implicit barrier of omp parallel directive EuroPVM/MPI 2003
Fine-grain Model FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); } EuroPVM/MPI 2003
Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Coarse-grain Model • SPMD paradigm • Requires more programming effort • Threads are only spawned once • Inter-node communication inside multi-threaded part (requires MPI_THREAD_MULTIPLE) • Thread synchronization through explicit barrier (omp barrier directive) EuroPVM/MPI 2003
Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); } #pragma omp barrier } } EuroPVM/MPI 2003
Summary: Fine-grain vs Coarse-grain EuroPVM/MPI 2003
Overview • Introduction • Pure MPI model • Hybrid MPI-OpenMP models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) • Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) • FastEthernet interconnection • ADI micro-kernel benchmark (3D) EuroPVM/MPI 2003
Alternating Direction Implicit (ADI) • Unitary data dependencies • 3D Iteration Space (X x Y x Z) EuroPVM/MPI 2003
ADI – 4 nodes EuroPVM/MPI 2003
ADI – 4 nodes • X < Y • X > Y EuroPVM/MPI 2003
ADI X=512 Y=512 Z=8192 – 4 nodes EuroPVM/MPI 2003
ADI X=128 Y=512 Z=8192 – 4 nodes EuroPVM/MPI 2003
ADI X=512 Y=128 Z=8192 – 4 nodes EuroPVM/MPI 2003
ADI – 2 nodes EuroPVM/MPI 2003
ADI – 2 nodes • X < Y • X > Y EuroPVM/MPI 2003
ADI X=128 Y=512 Z=8192 – 2 nodes EuroPVM/MPI 2003
ADI X=256 Y=512 Z=8192 – 2 nodes EuroPVM/MPI 2003
ADI X=512 Y=512 Z=8192 – 2 nodes EuroPVM/MPI 2003
ADI X=512 Y=256 Z=8192 – 2 nodes EuroPVM/MPI 2003
ADI X=512 Y=128 Z=8192 – 2 nodes EuroPVM/MPI 2003
ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication EuroPVM/MPI 2003
ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication EuroPVM/MPI 2003
Overview • Introduction • Pure MPI model • Hybrid MPI-OpenMP models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003
Conclusions • Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm • Hybrid models can be competitive to the pure MPI paradigm • Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated • Programming efficiently in OpenMP not easier than programming efficiently in MPI EuroPVM/MPI 2003
Future Work • Application of methodology to real applications and benchmarks • Work balancing for coarse-grain model • Performance evaluation on advanced interconnection networks (SCI, Myrinet) • Generalization as compiler technique EuroPVM/MPI 2003
Questions? http://www.cslab.ece.ntua.gr/~ndros EuroPVM/MPI 2003