Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

Overview • Introduction • Pure MPI Model • Hybrid MPI-OpenMP Models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

Introduction • Motivation: • SMP clusters • Hybrid programming models • Mostly fine-grain MPI-OpenMP paradigms • Mostly DOALL parallelization EuroPVM/MPI 2003

Introduction • Contribution: • 3 programming models for the parallelization of nested loops algorithms • pure MPI • fine-grain hybrid MPI-OpenMP • coarse-grain hybrid MPI-OpenMP • Advanced hyperplane scheduling • minimize synchronization need • overlap computation with communication EuroPVM/MPI 2003

Introduction Algorithmic Model: FOR j0 = min0 TO max0 DO … FOR jn-1 = minn-1 TO maxn-1 DO Computation(j0,…,jn-1); ENDFOR … ENDFOR • Perfectly nested loops • Constant flow data dependencies EuroPVM/MPI 2003

Introduction Target Architecture: SMP clusters EuroPVM/MPI 2003

Pure MPI Model • Tiling transformation groups iterations into atomic execution units (tiles) • Pipelined execution • Overlapping computation with communication • Makes no distinction between inter-node and intra-node communication EuroPVM/MPI 2003

Pure MPI Model Example: FOR j1=0 TO 9 DO FOR j2=0 TO 7 DO A[j1,j2]:=A[j1-1,j2] + A[j1,j2-1]; ENDFOR ENDFOR EuroPVM/MPI 2003

j2 j1 Pure MPI Model CPU1 NODE1 CPU0 4 MPI nodes CPU1 NODE0 CPU0 EuroPVM/MPI 2003

Pure MPI Model tile0 = nod0; … tilen-2 = nodn-2; FOR tilen-1 = 0 TO DO Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); END FOR EuroPVM/MPI 2003

Hyperplane Scheduling • Implements coarse-grain parallelism assuming inter-tile data dependencies • Tiles are organized into data-independent subsets (groups) • Tiles of the same group can be concurrently executed by multiple threads • Barrier synchronization between threads EuroPVM/MPI 2003

j2 j1 Hyperplane Scheduling CPU1 2MPI nodes NODE1 CPU0 x 2OpenMP threads CPU1 NODE0 CPU0 EuroPVM/MPI 2003

Hyperplane Scheduling #pragma omp parallel { group0 = nod0; … groupn-2 = nodn-2; tile0 = nod0 * m0 + th0; … tilen-2 = nodn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ; if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier } } EuroPVM/MPI 2003

Fine-grain Model • Incremental parallelization of computationally intensive parts • Relatively straightforward from pure MPI • Threads (re)spawned at computation • Inter-node communication outside of multi-threaded part • Thread synchronization through implicit barrier of omp parallel directive EuroPVM/MPI 2003

Fine-grain Model FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); } EuroPVM/MPI 2003

Coarse-grain Model • SPMD paradigm • Requires more programming effort • Threads are only spawned once • Inter-node communication inside multi-threaded part (requires MPI_THREAD_MULTIPLE) • Thread synchronization through explicit barrier (omp barrier directive) EuroPVM/MPI 2003

Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, nod); } #pragma omp barrier } } EuroPVM/MPI 2003

Summary: Fine-grain vs Coarse-grain EuroPVM/MPI 2003

Overview • Introduction • Pure MPI model • Hybrid MPI-OpenMP models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

Experimental Results • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) • MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) • Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) • FastEthernet interconnection • ADI micro-kernel benchmark (3D) EuroPVM/MPI 2003

Alternating Direction Implicit (ADI) • Unitary data dependencies • 3D Iteration Space (X x Y x Z) EuroPVM/MPI 2003

ADI – 4 nodes EuroPVM/MPI 2003

ADI – 4 nodes • X < Y • X > Y EuroPVM/MPI 2003

ADI X=512 Y=512 Z=8192 – 4 nodes EuroPVM/MPI 2003

ADI – 2 nodes EuroPVM/MPI 2003

ADI – 2 nodes • X < Y • X > Y EuroPVM/MPI 2003

ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication EuroPVM/MPI 2003

ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication EuroPVM/MPI 2003

Overview • Introduction • Pure MPI model • Hybrid MPI-OpenMP models • Hyperplane Scheduling • Fine-grain Model • Coarse-grain Model • Experimental Results • Conclusions – Future Work EuroPVM/MPI 2003

Conclusions • Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm • Hybrid models can be competitive to the pure MPI paradigm • Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated • Programming efficiently in OpenMP not easier than programming efficiently in MPI EuroPVM/MPI 2003

Future Work • Application of methodology to real applications and benchmarks • Work balancing for coarse-grain model • Performance evaluation on advanced interconnection networks (SCI, Myrinet) • Generalization as compiler technique EuroPVM/MPI 2003

Questions? http://www.cslab.ece.ntua.gr/~ndros EuroPVM/MPI 2003

Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens

Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens

Presentation Transcript

National Technical University of Athens Unit of Environmental Science and Technology

Prof. Maria Loizidou National Technical University of Athens (NTUA) mloiz@chemeng.ntua.gr

Dr. Konstantinos Moustakas National Technical University of Athens konmoust@central.ntua.gr

Nikos Anastopoulos Nectarios Koziris National Technical University of Athens

Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

National and Kapodistrian University of Athens

Christina Alexandris National University of Athens and

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS GREECE

National Technical University of Athens Unit of Environmental Science and Technology

NATIONAL TECHNICAL UNIVERSITY OF ATHENS DEPARTMENT OF PHYSICS

NATIONAL UNIVERSITY OF ATHENS

Nikolaos Frangiadakis Department of Informatics, University of Athens

Prof. Nectarios Koziris Vice Chairman, GRNET ICCS, NTUA

The National Technical University of Athens Unit of Process Control and Informatics

National Technical University of Athens

Konstantinos Moustakas National Technical University of Athens

Dr. D. Fatta National Technical University of Athens

NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF MINING AND METALLURGICAL ENGINEERING

National Technical University of Athens (NTUA) Greece

NATIONAL TECHNICAL UNIVERSITY OF ATHENS DEPARTMENT OF PHYSICS

NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF CHEMICAL ENGINEERING

Aristotelis Charalampakis and Vlasis Koumousis National Technical University of Athens