220 likes | 298 Views
What is required for "standard" distributed parallel programming model?. Mitsuhisa Sato Taisuke Boku and Jinpil Lee University of Tsukuba. My Background and Position. OpenMP A s tandard parallel programming model and API for shared memory multiprocessors
E N D
What is required for "standard" distributed parallel programming model? Mitsuhisa Sato Taisuke Boku and Jinpil Lee University of Tsukuba
My Background and Position • OpenMP • A standard parallel programming model and API for shared memory multiprocessors • Extend the base language (Fortran/C/C++) with directives or pragma • Incremental parallel programming, keep sequential semantics with ignoring directives • allows range of programming styles • For scientific applications. Support for loop-based parallelism • Target: small-scale(~16processors)to medium-scale (~64processors) • First draft is published in 1997, now this standard is getting accepted for multi-core era. • Omni OpenMP compiler project(… now, inactive) • The project done in Real World Computing Partnership (RWCP, ~2002) • Research Objectives • Portable implementation of OpenMP for SMPs • Design and implementation of Cluster-enabled OpenMP for PC/WS/SMP clusters • Support seamless programming from SMPs to clusters. • Using page-based Software Distributed Shared Memory System • Free and Open-Source, released since 1998
Agenda • OpenMPD : directive-based programming model for distributed memory • What is required for "standard" distributed parallel programming model?
OpenMPD : directive-based programming model for distributed memory • Objectives • Providing a simple and “easy-to-understand” programming model for distributed memory • OpenMP is just for shared memory, not for distributed memory • Supporting data parallelization and typical parallelization pattern by adding directive similar to OpenMP (inspired by OpenMP)
Features of OpenMPD • Directive-based programming model for distributed memory system • C programming language (Fortran) + directives • Explicit communication and synchronization • All action is taken by directive for being “easy-to-understand” in performance tuning • Support typical communication pattern • Scatter/gather, reduction, neighbor communication, … • “Directives” describe typical data parallelization • array distribution, data synchronization, … • Highly portable implementation with translation to MPI • the compiler translate the directives into parallel code using MPI functions
Code Example int array[YMAX][XMAX]; #pragma ompd distvar(var = array;dim = 2) main(){ int i, j, res; res = 0; #pragma ompd for affinity(array) reduction(+:res) for(i = 0; i < 10; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); res += array[i][j]; } } data distribution add to the serial code : incremental parallelization work sharing and data synchronization
The same code written in MPI int array[YMAX][XMAX]; main(int argc, char**argv){ int i,j,res,temp_res, dx,llimit,ulimit,size,rank; MPI_Init(argc, argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); dx = YMAX/size; llimit = rank * dx; if(rank != (size - 1)) ulimit = llimit + dx; else ulimit = YMAX; temp_res = 0; for(i = llimit; i < ulimit; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); temp_res += array[i][j]; } MPI_Allreduce(&temp_res, &res, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Finalize(); }
Array data distribution • Each processor computes on different regions • #pragma ompd distvar(var=list; dim=num; sleeve=size) array[] In current implementation, whole array are replicated in each node CPU0 CPU1 CPU2 CPU3 Sync. on whole array Reference to assigned to other nodes Synchronization on data → Sync on sleeve area The programmer choose which sync is required
Data synchronization of array (Gather) • Gather operation to distribute data to every nodes • #pragma ompd gather(var=list) • Execute communication to get data assigned to other nodes • Most easy way to synchronize →But, communication is expensive! array[] CPU0 CPU1 CPU2 CPU3 Now, we can access correct data by local access !!
Data synchronization of array (Sleeve) • Exchange data only on “sleeve” region • If neighbor data is required to communicate, then only sleeve area can be considered. • example:b[i] = array[i-1] + array[i+1] array[] #pragma ompd distvar(var = array; dim = 1 ) ; sleeve = 1) • Different from gather operation, communcation on sleeve is cheaper. • User has to specify sleeve region with the size. CPU0 CPU1 CPU2 CPU3 Programmer specifies sleeve region explicitly Directive:#pragma ompd sync_sleeve(var=array)
Parallel Execution of “for” loop • Execute for loop to compute on array Data region to be computed by for loop for(i=2; i <=10; i++) array[] Execute for loop in parallel with affinity to array distribution: #pragma foraffinity(array) CPU0 CPU1 CPU2 CPU3 Array distribution
Experimental Results constant speed-up with moderate scalability performance degraded by lack of multi-dim. array distribution
Related Work • OpenMP • Just only for shared memory • Unified Parallel C • PGAS (Partitioned Global Address Space) Language • Co-Array Fortran • Also, PGAS • Above two providing alternative programming models of MPI for distributed memory • OpenWP?
Future Work and Plan for OpenMPD • Multi-dimensional array distribution and nested parallel loop execution • Integration of PGAS feature for more flexible communication pattern and data distribution • Current OpenMPD only support typical cases. • Remote memory access (one-side communication) • Part of assigned data should be allocated in each node • Address translation is required. • Supporting hybrid programming with OpenMP within node in SMP/multicore node clusters,even with MPI!
Agenda • OpenMPD : directive-based programming model for distributed memory • What is required for "standard" distributed parallel programming model?
Message Passing Model (MPI) • Message passing model was the dominant programming model in the past. • …. Yes. • Message passing is the dominant programming model today. • … Unfortunately, yes… • Will OpenMP be a programming model for future system? • … I hope so, but it is not perfect. • OpenMP is only for shared memory model. • (I think) some features for performance turning are missing • data mapping, scalability, IO…
For application programmers • Are programmers satisfied with MPI? • yes…? Many programmers writes MPI. • Is MPI enough for parallelizing scientific parallel programs? • Application programmer’s concern is to get their answers faster!! • Automatic parallelizing compiler is the best, but … many problems remain.
“Life is too short for MPI”(from WOMPAT2001 T-shirt message) • Simple N-body problem #pragma omp parallel • MPI • Data partitioning • scheduling • communication (broadcast, reduction) for(i = 0; i < n_particles; i++) { p = &particles[i]; ax = 0.0; ay = 0.0; az = 0.0; for(j = 0; j < n_particles; j++){ if(i == j) continue; q = &particles[j]; dx = p->x - q->x; dy = p->y - q->y; dz = p->z - q->z; X = dx * dx + dy * dy + dz * dz; if (X < b2) { f = q->m * (X - a2) * (X - b2); ax += f * dx; ay += f * dy; az += f * dz; } } p->ax = ax; p->ay = ay; p->az = az; } for(i = 0; i < n_particles; i++){ p = &particles[i]; p->x += p->vx * DT; p->y += p->vy * DT; p->z += p->vz * DT; p->vx += p->ax * DT; p->vy += p->ay * DT; p->vz += p->az * DT; } It takes several hours with MPI OpenMP just put #pragma omp parallel at loop!!! It takes just a few 10 min!!! #pragma omp parallel
Parallel programming languages pC++ SISAL NESL Clik pHaskel Prolog Orca mpC C* dataparallel C Jede HPC++ mpc++ HPF Linda Mentat Fortran M Occam APL SAL Split-C Fortran D V Charm++ CODE ZPL Fortran X3H5 ….. • Programming language design reflects its model. • So far, many parallel programming languages were proposed in computer science community. • Are they actually used by application users? • Where were they gone? • What is missing in them?
Think about MPI, … • Why was MPI accepted and so successful? • Portability: Most parallel computing platforms can run MPI programs (even in SMP). • Many free and portable software such as MPICH. • Education: MPI Standard allows many programmers to learn MPI parallel programming. • In university • By book
Discussion • The demand for parallel programming is increasing!! • Low cost PC clusters • SMP in PC box. • On-chip multiprocessors, … multiprocessors even in PDA, now! • Of course, … clear and excellent concept of modeling, good performance, … many factors are important! • Standardization and Education are important for widespread use. • Standardization enables a good education. • It must be available in many platforms.
Discussion • Cost of parallelization is also important for acceptance by application programmers. • Easy to transfer from an original sequential program. • What application programmers need to learn must be small. • We have a plan to organize the group for “standard” parallel programming language for petaflops systems • Will be supported by RIKEN • Try to find a fund for development • Should be international. • For the standard, “agreement” process is important rather than “advanced” idea. • Standardization and Education