1 / 69

Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC. StarSs overview SMPSs SMPSs examples Single node hands-on Hybrid model MPI/SMPSs Programming examples MPI/SMPSs hands-on. Slides available in: http://marsa.ac.upc.edu/prace/course_material_starss

talasi
Download Presentation

Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming with StarSs Rosa M. Badia Computer Sciences Research Dept. BSC

  2. StarSs overview SMPSs SMPSs examples Single node hands-on Hybrid model MPI/SMPSs Programming examples MPI/SMPSs hands-on Slides available in: http://marsa.ac.upc.edu/prace/course_material_starss /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial_PRACE_starss.ppt Agenda

  3. StarSs overview

  4. StarSs A “node” level programming model C/Fortran + directives Nicely integrates in hybrid MPI/StarSs Natural support for heterogeneity Programmability Incremental parallelization/restructure Abstract/separate algorithmic issues from resources Disciplined programming Portability “Same” source code runs on “any” machine Optimized task implementations will result in better performance. “Single source” for maintained version of a application Performance Asynchronous (data-flow) execution and locality awareness Intelligent Runtime: specific for each type of target platform. Automatically extracts and exploits parallelism Matches computations to resources StarSs StarSs GridSs CellSs CompSs (Java) SMPSs ClusterSs GPUSs ClearSpeedSs

  5. NANOS++ ~2008 History / Strategy CompSs ~2007 CellSs ~2006 GPUSs ~2009 GridSs ~2002 SMPSs V2 ~2009 PERMAS ~1994 NANOS ~1996 SMPSs V1 ~2007

  6. StarSs: a sequential program … void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

  7. 1 2 3 4 5 7 8 6 20 18 17 19 9 10 11 12 13 14 15 16 Color/number: order of task instantiation Some antidependences covered by flow dependences not drawn StarSs: … taskified … #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); Compute dependences @ task instantiation time for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);

  8. Decouple how we write form how it is executed 1 1 1 2 2 4 5 3 Write 6 6 6 7 Execute 2 2 2 3 7 8 7 8 StarSs: … and executed in a data-flow model #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); Color/number: a possible order of task execution

  9. StarSs: the potential of data access information • Flat global address space seen by programmer • Flexibility to dynamically traverse dataflow graph “optimizing” • Concurrency. Critical path • Memory access: data transfers performed by run time • Opportunities for • Prefetch • Reuse • Eliminate antidependences (rename) • Replication management • Coherency/consistency handled by the runtime

  10. StarSs: … reductions #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) reduction(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); 1 1 1 2 2 3 3 2 4 4 4 5 2 2 2 3 5 6 5 6 Color/number: possible order of task execution

  11. StarSs: just a few directives #pragma css task[input( parameters) ] \ [output ( parameters)] \ [inout ( parameters)] \ [target device([cell, smp, cuda] ) ] \ [implements ( task_name) ] \ [reduction ( parameters) ] \ [highpriority ] #pragma css wait on (data_address) #pragma css barrier #pragma css mutex lock ( variable) #pragma css mutex unlock( variable) parameters: parameter[ , parameter]* parameter:variable_name{[dimension]}*

  12. StarSs Syntax Examples: task selection #pragma css task input(A, B) inout(C)‏ void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ... #pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])‏ void block_addmultiply( float *C, float *A, float *B ) { .. #pragma css task input(A[size][size], B[size][size], size) inout(C[size][size])‏ void block_addmultiply( float *C, float *A, float *B, int size ) { .. Note: task annotations can be before function code or before function declaration, in a header file. Examples: waiting for data #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ... are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error); #pragma css wait on (sq_error)‏ if (sq_error >0.0000001)‏{ ...

  13. StarSs Syntax Examples: reductions #pragma css task input (vec, n) inout(results) reduction (results) void sum_task(int *vec, int n, int *results) { int i; int local_sol=0; for (i = 0; i < n; i++) local_sol += vec[i]; #pragma css mutex lock (results) *results = *results + local_sol; #pragma css mutex unlock (results) } Examples: reductions. The parameter used in the mutex can be anything that identifies the access uniquely #pragma css task input (vec, n, id) inout(results) reduction (results) void sum_task(int *vec, int n, int *results, int id) { int i; int local_sol=0; for (i = 0; i < n; i++) local_sol += vec[i]; #pragma css mutex lock (id) *results = *results + local_sol; #pragma css mutex unlock (id) }

  14. StarSs Syntax MPI tasks #pragma css task input(partner, bufsend) output(bufrecv) target(comm_thread) void communication(int partner, double bufsend[BLOCK_SIZE], double bufrecv[BLOCK_SIZE]) { int ierr; MPI_Request request[2]; MPI_Status status[2]; ierr = MPI_Isend(bufsend, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[0]); ierr = MPI_Irecv(bufrecv, BLOCK_SIZE, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &request[1]); MPI_Waitall(2, request, status); }

  15. StarSs Syntax in Fortran subroutine example()‏ ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏ implicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS)‏ real, intent (inout) :: C(BS,BS)‏ end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE)‏ ... !$CSS FINISH ... end subroutine !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏ ... end subroutine

  16. StarSs Syntax in Fortran: with MPI subroutine example()‏ ... interface !$css task TARGET(COMM_THREAD) subroutine checksum(i, u1, d1, d2, d3) implicit none integer, intent(in) :: i, d1, d2, d3 double complex, intent(in) :: u1(d1*d2*d3) end subroutine end interface ...

  17. SMPSs

  18. FU FU FU SMPSs implementation Slave threads IFU DEC REN IQ ISS REG Main thread RET J.M. Perez, et al, “A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures” Cluster 2008

  19. SMPss: Compiler phase app.c SMPSS-CC Code translation (mcc)‏ app.tasks (tasks list)‏ app.o pack smpss-cc_app.c smpss-cc_app.o C compiler (gcc, icc, ...)‏

  20. SMPss: Linker phase app.o app.c app.c SMPSS-CC glue code generator app.tasks smpss-cc-app.c smpss-cc-app.c exec-registration.c unpack exec-adapters.c C compiler (gcc, icc,...)‏ smpss-cc_app.o app-adapters.cc app-adapters.c exec-adapters.o exec-registration.o Linker libSMPSS.so exec

  21. SMPSs Programming examples

  22. Programing examples

  23. Where information on size and address comes from Barriers for timing purposes Simple examples #pragma css task input(A, B)inout(C)‏ void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...} #pragma css task input(A[BS][BS], B[BS][BS])inout(C[BS][BS])‏ void dgemm( float *C, float *A, float *B ) { ...} #pragma css task input(A, B)inout(C)‏ void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...} main() { ( ... #pragma css barrier t = time(); dgemm(A,B,C); dgemm(D,E,F); dgemm(C,F,G); Dgemm(C,E,H); #pragma css barrier t=time() – t; prinft (“result G = %f; time = %d\n”, G[0][0]); } 1 2 4 3

  24. Simple examples #pragma css task input(A, B) inout(C)‏ void dgemm( float C[N][N], float A[N][N], float B[N][N] ) { ...} main() { ( ... dgemm(A,B,C); dgemm(D,E,F); dgemm(C,F,G); dgemm(A,D,H); dgemm(C,H,I); #pragma css waiton (F) prinft (“result F = %f\n”, F[0][0]); dgemm(H,G,C); #pragma css barrier prinft (“result C = %f\n”, C[0][0]); } 1 2 4 5 3 6

  25. BS NB BS NB BS BS MxM on matrix stored by blocks int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j]); } #pragma css task input(A, B)inout(C) static void mm_tile ( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < BS; i++) for (j=0; j < BS; j++) for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j]; } Will work on matrices of any size Will work on any number of cores/devices

  26. static void dgem_64x64(volatile float *blkC, volatile float *blkA, volatile float *blkB) { unsigned int i; volatile float *ptrA, *ptrB, *ptrC; vector float a0, a1, a2, a3; vector float a00, a01, a02, a03; vector float a10, a11, a12, a13; ….. for(i=0; i<M; i+=4){ ptrA = &blkA[i*M]; ptrB = &blkB[0]; ptrC = &blkC[i*M]; a0 = *((volatile vector float *)(ptrA)); a1 = *((volatile vector float *)(ptrA+M)); a2 = *((volatile vector float *)(ptrA+2*M)); a3 = *((volatile vector float *)(ptrA+3*M)); a00 = spu_shuffle(a0, a0, pat0); a01 = spu_shuffle(a0, a0, pat1); a02 = spu_shuffle(a0, a0, pat2); a03 = spu_shuffle(a0, a0, pat3); a10 = spu_shuffle(a1, a1, pat0); a11 = spu_shuffle(a1, a1, pat1); …… a33 = spu_shuffle(a3, a3, pat3); Loads4RegSetA(0); Ops4RegSetA(); Loads4RegSetB(8); StageCBA(0,0); StageACB(8,0); StageBAC(16,0); StageCBA(24,0); StageACB(32,0); StageBAC(40,0); StageMISC(0,0); StageCBAmod(0,4); StageACB(8,4); StageBAC(16,4); StageCBA(24,4); StageACB(32,4); StageBAC(40,4); StageMISC(4,4); StageCBAmod(0,8); StageACB(8,8); ….. BS NB BS NB BS BS #define StageCBA(OFFSET,OFFB) \ { \ ALIGN8B; \ SPU_FMA(c0_0B,a00,b0_0B,c0_0B); c0_0C = *((volatile vector float *)(ptrC+OFFSET+16)); \ SPU_FMA(c1_0B,a10,b0_0B,c1_0B); c1_0C = *((volatile vector float *)(ptrC+M+OFFSET+16)); \ SPU_FMA(c2_0B,a20,b0_0B,c2_0B); c2_0C = *((volatile vector float *)(ptrC+2*M+OFFSET+16)); \ SPU_FMA(c3_0B,a30,b0_0B,c3_0B); SPU_LNOP; \ SPU_FMA(c0_1B,a00,b0_1B,c0_1B); c3_0C = *((volatile vector float *)(ptrC+3*M+OFFSET+16)); \ SPU_FMA(c1_1B,a10,b0_1B,c1_1B); c0_1C = *((volatile vector float *)(ptrC+OFFSET+20)); \ SPU_FMA(c2_1B,a20,b0_1B,c2_1B); c1_1C = *((volatile vector float *)(ptrC+M+OFFSET+20)); \ SPU_FMA(c3_1B,a30,b0_1B,c3_1B); SPU_LNOP; \ SPU_FMA(c0_0B,a01,b1_0B,c0_0B); c2_1C = *((volatile vector float *)(ptrC+2*M+OFFSET+20)); \ ……. MxM @ CellSs int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) dgem_64x64( C[i][j], A[i][k], B[k][j]); } #ifdef SPU_CODE #include <blas_s.h> #endif #pragma css task input(A[64][64], B[64][64])inout(C[64][64]) void dgemm_64x64(double *C, double *A, double *B); Leverage existing kernels, libraries,…

  27. BS NB BS NB BS BS MxM @ GPUSs using CUBLAS kernel int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) mm_tile( C[i][j], A[i][k], B[k][j], BS); } #pragma css task input(A[NB][NB], B[NB][NB], NB)\ inout(C[NB][NB])target device(cuda) void mm_tile (float *A, float *B, float *C, int NB) { unsigned char TR = 'T', NT = 'N'; float DONE = 1.0, DMONE = -1.0; float *d_A, *d_B, *d_C; cublasSgemm (NT, NT, NB, NB, NB, DMONE, A, NB, B, NB, DONE, C, NB); }

  28. Cholesky factorization Cholesky factorization Common matrix operation used to solve normal equations in linear least squares problems. Calculates a triangular matrix (L) from a symetric and positive defined matrix A. Cholesky(A) = L L · Lt = A Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)‏ It can be decomposed in block operations

  29. Cholesky factorization In each iteration red and blue blocks are updated SPOTRF: Computes the Cholesky factorization of the diagonal block . STRSM: Computes the column panel SSYRK: Computes the row panel SGEMM: Updates the rest of the matrix block_syrk

  30. Cholesky factorization DIM DIM 64 64 main (){ ... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); } // Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); } } ... for (int i = 0; i < DIM; i++)‏ { for (int j = 0; j < DIM; j++)‏ { #pragma css wait on (A[i][j]) print_block(A[i][j]); } } ... } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ #pragma css task input(A[64][64]) inout(C[64][64])‏ void ssyrk_tile(float *A, float *C)‏ #pragma css task inout(A[64][64])‏ void spotrf_tile(float *A)‏

  31. Cholesky factorization

  32. N Queens Compute the number of solutions to the problem of locating N queens on an N  N board, without any of them killing any other

  33. N Queens: sequential version void nqueens(int n, int j, char *a, int *results) { int i; if (n == j) { /* put good solution in heap, return pointer to it. */ if (!find_solution){ find_solution = 1; memcpy(sol, a, n * sizeof(char)); } (*results)++; } else { /* try each possible position for queen <j> */ for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { nqueens(n, j + 1, a, results); } } } } ... nqueens (n, 0, a, &total_res); ... int ok(int n, char *a) { int i, j; char p, q; for (i = 0; i < n; i++) { p = a[i]; for (j = i + 1; j < n; j++) { q = a[j]; if (q == p || q == p - (j - i) || q == p + (j - i)) return 0; } } return 1; }

  34. Queens strategy in SMPSs Sequential execution CUT level Solution Task

  35. N Queens • Manual memory allocation void nqueens(int n, int j, char *a, int depth) { int i; for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { b= malloc (BOARD_SIZE*SIZEOF(int)); mcopy (b, a, j*SIZEOF(int)); nqueens_ser_task(n, j + 1, b, &total_res); } } } } #pragma css task input (n, j, a[n]) inout (results)\ reduction (results) void nqueens_ser_task(int n, int j, char *a, int *results) { int i; int local_sols=0; if (n == j) { #pragma css mutex lock (&find_solution) if (!find_solution){ find_solution = 1; memcpy(sol, a, n * sizeof(char)); } #pragma css mutex unlock (&find_solution) local_sols++; } else { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a, &local_sols); } } } #pragma css mutex lock (results) *results = *results + local_sols; #pragma css mutex unlock (results) }

  36. N Queens • Memory allocation: automatic renames the array void nqueens(int n, int j, char *a, char *b, int depth) { int i; for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { nqueens_ser_task(n, j + 1, b, &total_res); } } } } #pragma css task input (n, j, a[n]) inout (results)\ reduction (results) void nqueens_ser_task(int n, int j, char *a, int *results) { int i; int local_sols=0; if (n == j) { #pragma css mutex lock (&find_solution) if (!find_solution){ find_solution = 1; memcpy(sol, a, n * sizeof(char)); } #pragma css mutex unlock (&find_solution) local_sols++; } else { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { nqueens_ser_task(n, j + 1, a, &local_sols); } } } #pragma css mutex lock (results) *results = *results + local_sols; #pragma css mutex unlock (results) } #pragma css task input (j, i, n)\ inout (a[n]) highpriority void add_queen_task(char *a, int j, int i, int n) { a[j] = i; }

  37. Stream Stream is one of the HPC Challenge benchmark suite http://icl.cs.utk.edu/hpcc/ Stream is a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel. Original OpenMP version: inserts barriers after each operation has processed all elements of the array Two StarSs versions: with and without barriers Without barriers the correctness is guaranteed thanks to the data-dependence analysis Initialize: a, b, c Copy: c = a Scale: b = s*c Add: c = a + b Triad: a = b + s*c

  38. Stream: mimicking original version main (){ #pragma css start tuned_initialization(); #pragma css barrier scalar = 3.0; for (k=0; k<NTIMES; k++)} tuned_STREAM_Copy(); #pragma css barrier tuned_STREAM_Scale(scalar); #pragma css barrier tuned_STREAM_Add(); #pragma css barrier tuned_STREAM_Triad(scalar); #pragma css barrier } #pragma css finish } #pragma css task input (a) output (c) void copy_task(double a[BSIZE], double c[BSIZE]) { int j; for (j=0; j < BSIZE; j++) c[j] = a[j]; } void tuned_STREAM_Copy() { int j; for (j=0; j<N; j+=BSIZE) copy_task (&a[j], &c[j]); } #pragma css task input (c, scalar) output (b) void scale_task (double b[BSIZE], double c[BSIZE], double scalar) { int j; for (j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Scale(double scalar) { int j; for (j=0; j<N; j+=BSIZE) scale_task (&b[j], &c[j], scalar); }

  39. Stream #pragma css task input (a, b) output (c) void add_task (double a[BSIZE], double b[BSIZE], double c[BSIZE]) { int j; for (j=0; j < BSIZE; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Add() { int j; for (j=0; j<N; j+=BSIZE) add_task(&a[j], &b[j], &c[j]); } #pragma css task input (b, c, scalar) output (a) void triad_task (double a[BSIZE], double b[BSIZE], double c[BSIZE], double scalar) { int j; for (j=0; j < BSIZE; j++) a[j] = b[j]+scalar*c[j]; } void tuned_STREAM_Triad(double scalar) { int j; for (j=0; j<N; j+=BSIZE) triad_task (&a[j], &b[j], &c[j], scalar); }

  40. Stream: version without barriers Version without barriers: SMPSs/CellSs data dependence analysis guarantees correctness main (){ #pragma css start tuned_initialization(); #pragma css barrier scalar = 3.0; for (k=0; k<NTIMES; k++)} tuned_STREAM_Copy(); tuned_STREAM_Scale(scalar); tuned_STREAM_Add(); tuned_STREAM_Triad(scalar); } #pragma css finish } same iteration Init next iteration Copy Add Scale Triad 1 2 3 4 5 6 7 8 9 11 12 10 13 14 15 16 18 17 19 20

  41. Molecular dynamics: Argon Molecular dynamics: Argon simulation Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K All elestrostatic forces observed for each of the atoms against the others are considered (Fi)‏ The second Newton law is then applied to each atom Fi=m*ai The initial velocities are random but reasonable for argon atoms at 300K To maintain a constant temperature in all the process the Berendsen algorithm is applied

  42. Molecular dynamics: Argon simulation program argon ... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo enddo !$CSS FINISH end program argon ... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE)‏ real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface

  43. Molecular dynamics: Argon simulation program argon ... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo enddo !$CSS FINISH end !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz) ! subroutine code end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ ! subroutine code end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ ! subroutine code end subroutine

  44. Hands-on Copy files from: /gpfs/scratch/bsc19/bsc19776/TutorialPRACE/tutorial.tar.gz

  45. Extensions to OpenMP

  46. Extensions to OpenMP tasking Dependences: not all arguments in directionality clause Heterogeneous devices Different implementations Separation dependences/transfers E. Ayguade, et all, “A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures” LNCS, IWOMP 2009, IJPP2010

  47. Inline directives: saves manual outlining !!! Tasks have no name  not multiple implementations Sparse LU (data by blocks)

  48. Annotated function declaration: ALL instances become tasks Heterogeneous Cholesky

  49. Array sections

  50. Hybrid MPI/SMPSs

More Related