250 likes | 360 Views
Portable Multi-Level Parallel Programming. March 4th 2008, Simula, Oslo. Gerhard Zumbusch Institut für Angewandte Mathematik Friedrich-Schiller-Universität Jena. Parallel Programming?.
E N D
Portable Multi-LevelParallel Programming March 4th 2008, Simula, Oslo Gerhard ZumbuschInstitut für Angewandte MathematikFriedrich-Schiller-Universität Jena
Parallel Programming? • Applications will increasingly need to be concurrent if they want to fully exploit continuing exponential CPU throughput gains. • Therefore single-threaded programs are likely not to get faster any more except for benefits from further cache size groth (…). • Finally, programming languages and systems will increasingly be forced to deal well with concurreny. Herb Sutter: „The free lunch is over“(Dr.Dobbs´s 30(3), 2005) or: massively parallel in capability computing
Parallel-Programming for free. Instruction Parallelism • Memory layout:data close to registers • Programming Model: • use optimising compiler • loop unrolling, instruction re-ordering if needed Sequential code
Data-Parallel Programming Programming Models • Vector Instructions • Thread-Parallel • MPI Message-Passing • Cell with DMA Block Transfers • Mixed and hybrid Models • OpenMP and HPF parallel loops for (int i=1; i<n; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( y(i) ); }
Data-Parallel Programming … float half =.5; _mm_store_ps(&x[i], _mm_mul_ps( _mm_load1_ps(&half), _mm_add_ps(_mm_loadu_ps(&y[i+1]), _mm_loadu_ps(&y[i-1]))); … Vector Processor • examples:SSE, AltiVecextensions • Data Layout:continuous data blocks • Programming Models: • Optimising Compiler • Special Instructions (intrinsics) SSE … float* y0 = &y[i+1], y1 = &y[i-1]; vec_st(vec_madd( vec_splats(.5), vec_add( vec_perm(vec_ld(0,y0), vec_ld(16,y0), vec_lvsl(0,y0)), vec_perm(vec_ld(0,y1), vec_ld(16,y1), vec_lvsl(0,y1))), vec_splats(0.)), 0, &x[i]); … AltiVec
symmetric multi processing data layout:read shared data write private data Programming models: threads (Pthreads, Java threads, Win threads,…) lazy evaluation (Concur, Cilk) for-loops (OpenMP, Fortran Arrays) void *sub1(void *arg) { ... double e_local = 0; for (int i=n_local0; i<n_local1; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e_local += sqr( y(i) ); } vec->e = e_local; } Data-Parallel Programming thread for (int p=0; p<p_threads; p++) pthread_create(&threads[p], threadAttr, sub1, (void *)vec[p]); for (int p=0; p<p_threads; p++) { pthread_join(threads[p], NULL); e += vec[p]->e; } main
Distributed memory data layout:local memory only,manage data transer explicitely Programming Models: Message passing (MPI-1) (Fortran Arrays) SGI shmem MPI-2, BSP,UPC, X10,… if (p_left) MPI_Send(&y(n_local0), 1, MPI_DOUBLE, p_left,...); if (p_right) { MPI_Recv(&y(n_local1+1), 1, MPI_DOUBLE, p_right,...); MPI_Send(&y(n_local1), 1, MPI_DOUBLE, p_right,...); } if (p_left) MPI_Recv(&y(n_local0-1), 1, MPI_DOUBLE, p_left,...); double e_local = 0; for (int i=n_local0; i<n_local1; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e_local += sqr( y(i) ); } MPI_AllReduce(&e_local, &e, 1, MPI_DOUBLE, MPI_SUM,...) Data-Parallel Programming proc p
Multi-Core Processor 8+1 processor cores IBM/Sony Cell BE
int main(unsigned long long id, addr64 argp, addr64 envp) { mfc_get(...); mfc_read_tag_status_all(); ... double e_local = 0; for (int i=n_local0; i<n_local1; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e_local += sqr( y(i) ); } mfc_put(...); mfc_read_tag_status_all(); } Data-Parallel Programming Cell Broadband Architecture • Data Layout: • Global Memory • parts as copy in local SPU memory (256kb) user controlled DMA blocktransfers • Programming Model: • special library calls • Fortran Arrays(?) SPU for (int p=0; p<spe; p++) { spe_context_create(...) spe_program_load(...) pthread_create(...); } for (int p=0; p<spe; p++) { pthread_join(...); spe_context_destroy(...); e += vec[p]->e; } void *sub1(void *arg) { spe_context_run(...); } CPU thread main
automatic code generation Grid1 *g = new Grid1(0, n+1); Grid1IteratorSub it(1, n, g); DistArray<double> x(g), y(g); double e = 0; application library ForEach(int i, it, x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( y(i) ); ) application specificlanguage extensions • data dependence analysis • modified g++ 4.2 • Analysis based on Tree SSA representation load y(i-1), y(i), y(i+1) store x(i) reduce add e code generation sequential code MPI parallel code parallel+vector Cell processor Code vectorised code thread parallel code MPI+thread parallel code
int n = 64; Grid1 *g = new Grid1(0, n+1); Grid1IteratorSub it(1, n, g); DistArray1<double> x(g), y(g); double e = 0.; ForEach(int i, it, ‘x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( x(i) ); ’ ) array references:local node read/write, neighbour nodes read only shared memory:schedule sub-grids Distributed memory:owner computes local sub-grid,exchange ghost data(message passing) Distributed-Grid summary auto detect dependencey(i+1), y(i-1)
Relaxation Scheme • Solve linear equation system iteratively • n data • O(n) sequential arithmetic operations per iteration • Parallel version: (block) Jacobi iteration
Multigrid Relaxation • Solve linear equation system iteratively • O(n) arithmetic Operations per iteration • Const # iterations
int n = 64; Grid1 *g = new Grid1(0, n+1); Grid1 *gf = new Grid1(0, 2*n+1, g, &f); DistArray1<double> x(g); DistArray1<double> z(gf); Grid1IteratorSub it(1, n, g); ForEach(int i, it, ‘x(i) = z(2*i)*.5 + ( z(2*i-1) + z(2*i+1))*.25; ’ ) ForEach(int i, it, ‘z(2*i) = x(i); z(2*i+1) = ( x(i) + x(i+1) )*.5; ’ ) static communication pattern mappingfine -> coarse grid grid: memory alignement data dependence z(2*i-1), x(i+1)
multigrid: MPI+pthreads 3D multigrid, finite differences, structured nested grids, V0,1 cycle (~Fapin, NAS benchmarks)4 * AMD dual-core Opteron 1.8GHz, g++ 64bit, Linux, Pthreads and/or Mpich
multigrid: MPI+pthreads 1 or 2 processes per cluster node (MPI or pthreads) 3D multigrid, finite Differences, uniform grid, 5133 grid points, (NAS Fapin)Intel dual-core cluster, 1Gbit/s ethernet, g++ 64bit, Linux, Mpich (und Pthreads)
Multigrid solver on Cell processor Single/double buffer, scalar/vector-Code 3D multigrid, finite differences, uniform grid, 1293 grid points, (NAS Fapin)Sony Playstation3, Linux, xlC 8.2
TopDownIterator<tree> down(root); ForEach(tree *b, down, ‘ for (int i=0; i<4; i++) if (b->child(i)) b->child(i)->l += b->l; ’ ) shared memory:coarse tree sequential,one thread per sub-tree distributed memory: replicated coarse tree,distributed fine sub-trees tree code: top down
BottomUpIterator<tree> up(root); ForEach(tree *b, up, ‘ for (int i=0; i<4; i++) if (b->child[i]) b->m += b->child[i]->m; ’ ) data dependence analysis: load child[] load child[]->m load,store this->m tree code: bottom up data exchange variable m
Require( list<tree*> inter, fetch ); double x0, x1; int fetch(tree *b) { return (x0==b->x1) || (x1==b->x0); } Distributed memory: Additional data exchange Sub-trees within geometrical neighbourhood tree code: local neighbours Super-set of all possible neighbours TopDownIterator<tree> down(root); ForEach(tree *b, down, ‘ for (list<tree*>::const_iterator i = b->inter.begin(); i != b->inter.end(); i++) b->l += log(abs(b->x - (*i)->x)) * (*i)->m; ’ ) data dependence analysis ->x, ->m
Tree code: MPI+pthreads 2D adaptive fast multipole, 20 complex coeff. Laurent-series, 2*106 particles (~Splash-2)4 * AMD dual-core Opteron 1.8GHz, g++ 64bit, Linux, Pthreads and/or Mpich
Tree code: MPI+pthreads 2 Processes per cluster node (MPI or pthreads) 2D adaptive fast multipol-method, 20 complex coeff. Laurent-series, (~Splash-2)Intel dual-core cluster, 1Gbit/s ethernet, g++ 64bit, Linux, Mpich (und Pthreads)
Conclusion • Tree & grid iterators for numerical codes: • Using code annotation & library aData dependence analysis at compile time aAutomatic parallelization • For shared memory, distributed memory, multi-core and mixed parallel target architectures • Domain specific parallel programming styles vs. parallel libraries vs. parallel languages?