What is required for "standard" distributed parallel programming model?

What is required for "standard" distributed parallel programming model? Mitsuhisa Sato Taisuke Boku and Jinpil Lee University of Tsukuba

My Background and Position • OpenMP • A standard parallel programming model and API for shared memory multiprocessors • Extend the base language (Fortran/C/C++) with directives or pragma • Incremental parallel programming, keep sequential semantics with ignoring directives • allows range of programming styles • For scientific applications. Support for loop-based parallelism • Target: small-scale(～16processors）to medium-scale (～64processors） • First draft is published in 1997, now this standard is getting accepted for multi-core era. • Omni OpenMP compiler project(… now, inactive) • The project done in Real World Computing Partnership (RWCP, ～2002) • Research Objectives • Portable implementation of OpenMP for SMPs • Design and implementation of Cluster-enabled OpenMP for PC/WS/SMP clusters • Support seamless programming from SMPs to clusters. • Using page-based Software Distributed Shared Memory System • Free and Open-Source, released since 1998

Agenda • OpenMPD : directive-based programming model for distributed memory • What is required for "standard" distributed parallel programming model?

OpenMPD : directive-based programming model for distributed memory • Objectives • Providing a simple and “easy-to-understand” programming model for distributed memory • OpenMP is just for shared memory, not for distributed memory • Supporting data parallelization and typical parallelization pattern by adding directive similar to OpenMP (inspired by OpenMP)

Features of OpenMPD • Directive-based programming model for distributed memory system • C programming language (Fortran) + directives • Explicit communication and synchronization • All action is taken by directive for being “easy-to-understand” in performance tuning • Support typical communication pattern • Scatter/gather, reduction, neighbor communication, … • “Directives” describe typical data parallelization • array distribution, data synchronization, … • Highly portable implementation with translation to MPI • the compiler translate the directives into parallel code using MPI functions

Code Example int array[YMAX][XMAX]; #pragma ompd distvar(var = array;dim = 2) main(){ int i, j, res; res = 0; #pragma ompd for affinity(array) reduction(+:res) for(i = 0; i < 10; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); res += array[i][j]; } } data distribution add to the serial code : incremental parallelization work sharing and data synchronization

The same code written in MPI int array[YMAX][XMAX]; main(int argc, char**argv){ int i,j,res,temp_res, dx,llimit,ulimit,size,rank; MPI_Init(argc, argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); dx = YMAX/size; llimit = rank * dx; if(rank != (size - 1)) ulimit = llimit + dx; else ulimit = YMAX; temp_res = 0; for(i = llimit; i < ulimit; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); temp_res += array[i][j]; } MPI_Allreduce(&temp_res, &res, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Finalize(); }

Array data distribution • Each processor computes on different regions • #pragma ompd distvar(var=list; dim=num; sleeve=size) array[] In current implementation, whole array are replicated in each node CPU0 CPU1 CPU2 CPU3 Sync. on whole array Reference to assigned to other nodes Synchronization on data → Sync on sleeve area The programmer choose which sync is required

Data synchronization of array (Gather) • Gather operation to distribute data to every nodes • #pragma ompd gather(var=list) • Execute communication to get data assigned to other nodes • Most easy way to synchronize →But, communication is expensive! array[] CPU0 CPU1 CPU2 CPU3 Now, we can access correct data by local access !!

Data synchronization of array (Sleeve) • Exchange data only on “sleeve” region • If neighbor data is required to communicate, then only sleeve area can be considered. • example：b[i] = array[i-1] + array[i+1] array[] #pragma ompd distvar(var = array; dim = 1 ) ; sleeve = 1) • Different from gather operation, communcation on sleeve is cheaper. • User has to specify sleeve region with the size. CPU0 CPU1 CPU2 CPU3 Programmer specifies sleeve region explicitly Directive：#pragma ompd sync_sleeve(var=array)

Parallel Execution of “for” loop • Execute for loop to compute on array Data region to be computed by for loop for(i=2; i <=10; i++) array[] Execute for loop in parallel with affinity to array distribution： #pragma foraffinity(array) CPU0 CPU1 CPU2 CPU3 Array distribution

Experimental Results constant speed-up with moderate scalability performance degraded by lack of multi-dim. array distribution

Related Work • OpenMP • Just only for shared memory • Unified Parallel C • PGAS (Partitioned Global Address Space) Language • Co-Array Fortran • Also, PGAS • Above two providing alternative programming models of MPI for distributed memory • OpenWP?

Future Work and Plan for OpenMPD • Multi-dimensional array distribution and nested parallel loop execution • Integration of PGAS feature for more flexible communication pattern and data distribution • Current OpenMPD only support typical cases. • Remote memory access (one-side communication) • Part of assigned data should be allocated in each node • Address translation is required. • Supporting hybrid programming with OpenMP within node in SMP/multicore node clusters,even with MPI!

Agenda • OpenMPD : directive-based programming model for distributed memory • What is required for "standard" distributed parallel programming model?

Message Passing Model (MPI) • Message passing model was the dominant programming model in the past. • …. Yes. • Message passing is the dominant programming model today. • … Unfortunately, yes… • Will OpenMP be a programming model for future system? • … I hope so, but it is not perfect. • OpenMP is only for shared memory model. • (I think) some features for performance turning are missing • data mapping, scalability, IO…

For application programmers • Are programmers satisfied with MPI? • yes…? Many programmers writes MPI. • Is MPI enough for parallelizing scientific parallel programs? • Application programmer’s concern is to get their answers faster!! • Automatic parallelizing compiler is the best, but … many problems remain.

“Life is too short for MPI”(from WOMPAT2001 T-shirt message) • Simple N-body problem #pragma omp parallel • MPI • Data partitioning • scheduling • communication (broadcast, reduction) for(i = 0; i < n_particles; i++) { p = &particles[i]; ax = 0.0; ay = 0.0; az = 0.0; for(j = 0; j < n_particles; j++){ if(i == j) continue; q = &particles[j]; dx = p->x - q->x; dy = p->y - q->y; dz = p->z - q->z; X = dx * dx + dy * dy + dz * dz; if (X < b2) { f = q->m * (X - a2) * (X - b2); ax += f * dx; ay += f * dy; az += f * dz; } } p->ax = ax; p->ay = ay; p->az = az; } for(i = 0; i < n_particles; i++){ p = &particles[i]; p->x += p->vx * DT; p->y += p->vy * DT; p->z += p->vz * DT; p->vx += p->ax * DT; p->vy += p->ay * DT; p->vz += p->az * DT; } It takes several hours with MPI OpenMP just put #pragma omp parallel at loop!!! It takes just a few 10 min!!! #pragma omp parallel

Parallel programming languages pC++ SISAL NESL Clik pHaskel Prolog Orca mpC C* dataparallel C Jede HPC++ mpc++ HPF Linda Mentat Fortran M Occam APL SAL Split-C Fortran D V Charm++ CODE ZPL Fortran X3H5 ….. • Programming language design reflects its model. • So far, many parallel programming languages were proposed in computer science community. • Are they actually used by application users? • Where were they gone? • What is missing in them?

Think about MPI, … • Why was MPI accepted and so successful? • Portability: Most parallel computing platforms can run MPI programs (even in SMP). • Many free and portable software such as MPICH. • Education: MPI Standard allows many programmers to learn MPI parallel programming. • In university • By book

Discussion • The demand for parallel programming is increasing!! • Low cost PC clusters • SMP in PC box. • On-chip multiprocessors, … multiprocessors even in PDA, now! • Of course, … clear and excellent concept of modeling, good performance, … many factors are important! • Standardization and Education are important for widespread use. • Standardization enables a good education. • It must be available in many platforms.

Discussion • Cost of parallelization is also important for acceptance by application programmers. • Easy to transfer from an original sequential program. • What application programmers need to learn must be small. • We have a plan to organize the group for “standard” parallel programming language for petaflops systems • Will be supported by RIKEN • Try to find a fund for development • Should be international. • For the standard, “agreement” process is important rather than “advanced” idea. • Standardization and Education

What is required for "standard" distributed parallel programming model?