520 likes | 783 Views
Shared Memory Parallelism. Programming in the SGI Environment. Parallel Programming on SGI. Automatic parallelization with APC Fortran and C compiler option: -apo HPF: pghpf (Portland Group), xHPF (third party)
E N D
Shared Memory Parallelism Programming in the SGI Environment
Parallel Programming on SGI • Automatic parallelization with APC • Fortran and C compiler option: -apo • HPF: pghpf (Portland Group), xHPF (third party) • parallelization by inserting OpenMP compiler directives (using the -mp compiler option) • Fortran !$omp parallel, !$omp do • C #pragma omp parallel • Low-Level parallelism • Posix standard Pthreads • SGI library sproc() • UNIX native fork() • Sequent library m_fork() • Message Passing • PVM, MPI • shmem • Definition of number of parallel threads (man pe_environ) • SMP (APC and OpenMP, sproc): setenv OMP_NUM_THREADS n • MPI: mpirun -np n prog
SMP Programming • The parallelism in the code: • can be found by the compiler • can be asserted by the programmer with directives to help the compiler to analyze the code • compiler analyzes data access rules based on the additional information presented • can be forced by the programmer with directives/pragmas • compiler has to be told about data access rules (shared/private) • explicit synchronization is often necessary to access the shared data • can be forced by the programmer with library calls • POSIX Pthreads
Automatic Parallelization % setenv OMP_NUM_THREADS 4
Automatically Parallelizable Code • linear assignments: • reductions: • triangular assignments • NOT parallelizable: • early exits DO I=1,N A(I) = B(I) + X(I) ENDDO DO I=1,N S = S + X(I) ENDDO DO I=1,N DO J=I,N A(J,I) = B(J,I) + X(I) ENDDO
Compiler Assertions • Assertions help the compiler to deal with data dependencies that cannot be analyzed based on program semantics alone: • C*$* (NO) CONCURRENTIZE #pragma (no) concurrentize • (not) to apply parallel optimizations, the scope is the entire compilation unit • C*$* ASSERT DO (SERIAL) #pragma serial • C*$* ASSERT DO (CONCURRENT) #pragma concurrent • implement the required action for the loop (nest) immediately following the directive • C*$* ASSERT PERMUTATION(A) #pragma permutation(a) • similar to the IVDEP directive, asserts that the indirect addressing with index array A in the following loop nest will not alias to the same storage location • C*$* ASSERT CONCURRENT CALL #pragma concurrent call • asserts that the following loop nest can be parallel despite subroutine invocations that cannot be analyzed for side effects • -LNO:ignore_pragmas compiler option is used to disable these directives
Assertion Examples do … do … c*$* assert do (serial) do … do … enddo enddo enddo enddo Compiler will not parallelize the outer loops, since the inner loop must be serial Compiler might parallelize the inner loops • The compiler is analyzing the code given the information in the assertion and does automatic parallelization if possible • even with the c*$* assert do (concurrent) directive, the compiler will not make loops parallel where it finds obvious data dependencies • If it does not work, next step is to introduce the parallelization directive with explicit rules for variable usage given by the user (e.g. OpenMP semantics) C*$* assert concurrent call do I=1,n call do_parallel_work(I) enddo C*$* assert permutation(i1) do k=1,nsite A(i1(k)) = A(i1(k))+force*B(k)*B(k) enddo
Manual Parallelization • When parallelizing a program (or part of it) manually, one needs to consider the following: • what is the highest level of parallelism (Coarse-grain vs fine-grain parallelism) • which data should be shared and which is private • how the data should be distributed between processors (this is special optimization with CC-NUMA programming) • how the work should be distributed between processors (or select automatic work distribution by the compiler) • can threads have simultaneous access to shared data, such that race conditions can occur
Creation of Parallel Regions • With OpenMP, the syntax is: • C$ OMP PARALLEL [clause[,clause]…] • ……… code to run in parallel • C$ OMP END PARALLEL • #pragma omp parallel [clause[,clause]…] • { • /* this code is run in parallel OMP_NUM_THREADS times */ • } • clause specifies the data association rules (default, shared, private) for each variable (array, common block) • internally, compiler generates a subroutine call to the body of parallel code. The data definitions determine how variables are passed to that subroutine • Old syntax: • C$ DOACROSS #pragma parallel • !$ par parallel • …… • !$ par end parallel Transition for data association
Data Status Association • In a serial program: • global data • variables at a fixed well known address (global variables) • subroutine arguments, common blocks, externals • local data • variables with address known only to a single block of code • automatic variables, (re-)created on the stack • variables with address allocated at load time (static) • all DATA, SAVE, -static compiled programs • dynamic variables allocated at run time • malloc, ALLOCATE, etc. • Parallel program has additional qualification: • shared data • variables that can be accessed by every thread • private data • variables with address known only to a single thread • volatile data • special type variables that can be changed outside of program • synchronization data
Data Status Transition in Parallel Programs • In parallel programs following transition takes place: • It is safe to work with the automatic local variables that are allocated on the stack. In a parallel region each thread of execution gets its own private stack- these variables are automatically private • Especially hazardous are the static local variables. Since these are allocated on the heap, their address is the same in different threads • Dynamic variables are mostly created on the heap , therefore shared in parallel program. However, when allocation happens from a parallel region these variables become private • OpenMP syntax allows declaration of the status of each variable. However, in a call hierarchy the variables would assume their default status as specified above
Volatile Variables • Volatile variables get special attention from the compiler • data that gets updated may be allocated in a register as part of compiler optimization • a write to shared data in one thread may therefore not be visible by another thread until the register is flushed to memory • this is mostly incorrect behaviour that can lead to a deadlock or incorrect results • variables such that an update has to propagate immediately to all threads have to be declared volatile • compiler performs no optimizations on volatile variables, nor on the code around them, such that every write to volatile variable is flushed to memory • there is a performance penalty associated with volatile variables • Syntax: • Fortran VOLATILE var-list • C volatile “type” var-list
Synchronization Variables • To be found in man sync(3F): • call synchronize • perform atomically k->j Operation and return the old or the new value: • OP stands for the functions: • add, sub, or, and, xor, nand; • i,j,k,l can be integer*4 or integer*8 variables • all these functions are guaranteed to be atomic and will be performed on globally visible arguments Full memory barrier: memory references will not cross FMB i = fetch_and_OP(j,k) <- atomically i=j, j OP k i = OP_and_fetch(j,k) <- atomically j OP k, i=j l = compare_and_swap(j,k,i) <- (j==k)? j=i,l=1 : l=0 i = lock_test_and_set(j,k) <- atomically: i=j, j=k call lock_release(i) <- atomically: i=0
Workload Distribution SCHEDULE(STATIC,6) 26 iter on 4 processors SCHEDULE(GUIDED,4) 26 iter on 4 processors • Load imbalance is the condition when some processors do more work then others. Load imbalance will lead to bad scalability. • Load imbalance can be diagnosed by profiling the programs and comparing the execution times for code in different threads • In OpenMP, compiler accepts directives for work distribution: • C$OMP DO SCHEDULE(type[,chunk]) where type is • STATIC iterations are divided into pieces at compile time (default) • DYNAMIC iterations assigned to processors as they finish, dynamically. This requires synchronization after each chunk iterations. • GUIDED pieces reduce exponentially in size with each dispatched piece • RUNTIME schedule determined by an environment variable OMP_SCHEDULE With RUNTIME it is illegal to specify chunk. • Compiler accepts -mp_schedtype=type switch
Example: Parallel Region in FORTRAN • Compiler directives are enabled with the -mp switch or -apo switch. With -apo automatic parallelization is requested. • In this example all loops can be made parallel automatically (-apo) SUBROUTINE MXM(A,B,C,LM,M,N,N1) DIMENSION A(M,L), B(N1,M), C(N1,L) DO 100 K=1,L DO 100 I=1,N C(I,K) = 0.0 100 CONTINUE C$OMP PARALLEL DO DO 110 K=1,L DO 110 I=1,N DO 110 J=1,M C(I,K) = C(I,K) + B(I,J)*A(J,K) 110 CONTINUE END Will be run serially, unless -apo is specified Default variable type association rules: all variables shared (I.e. A,B,C are shared) Induction variables I,J,K will be found automatically and assigned private scope
Parallelization Examples/3 • Code with shared data • three solutions possible: • scratch array f should be declared local in the subroutine • f should be isolated in special common block /scratch/, which • could be declared c$omp threadprivate(scratch), or • could be loaded with the flag -Wl,-Xlocal,scratch_ • f could be extended as f(*,”numthreads”) Common //a(im,jm),b(im,jm),c(im,jm),d(im,jm),f(im) do j=1,jm call dowork(j) enddo end subroutine dowork(j) Common //a(im,jm),b(im,jm),c(im,jm),d(im,jm),f(im) do I=1,im f(I) = (a(I,j) + b(I,j))*3.141592*j c(I,j) = f(I)**3 enddo end Making this loop parallel will lead to incorrect results, because all threads will use same address for scratch array f
False Sharing P0 P1 a[0] is loaded, but not used a[1] is loaded, but not used a[0] a[1] Cache line a[2] a[3] a[4] a[5] a[7] a[6] a[9] a[8] • Bad data or work distribution can lead to false sharing • example: • processor P0 does only even iterations • processor P1 does only odd iterations • both processors read and write same cache lines • each write invalidates the cache line of the pear processor • each processor runs cache coherency protocol on the cache lines, thus the cache lines will ping-pong from one cache to another for(i=myid; i<n; i+= nthr) { a[i] = b[i] + c[i]; } etc., normal behaviour Performance False Sharing 1 2 ... Number of processors
Scalability & Data Distribution • The Scalability of parallel programs might be affected by: • Amdahl law for the parallel part of the code • plot elapsed run times as function of number of processors • Load balancing and work distribution to the processors • profiling the code will reveal significant run times for libmp library routines • compare perfex for event counter 21 (fp operations) for different threads • False sharing of the data • perfex high value of cache invalidation events (event counter 31: store/prefetch exclusive to shared block in scache) • Excessive synchronization costs • perfex high value for store conditionals (event counter 4) • When cache misses account for only a small percentage of the run time, the program makes good use of the memory and its performance will not be affected by data placement • diagnose significant memory load with perfex statistics: high value of memory traffic • Program scalability will be improved with data placement if there is: • memory contention on one node • excessive cache traffic
Why Distribute? R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Bedrock ASIC Bedrock ASIC Bedrock ASIC Mem/Dir Mem/Dir Mem/Dir • CC-NUMA architecture optimization. If: • all data initialized on one node • memory bottle neck • all shared data on one node • serialization of the CC • data used from different node • large interconnect traffic • average net latency on SGI3800: memory 16 proc ~(1/2 log2(p/16)+1)*50 ns Processor(s)
SGI Additions to OpenMP • Data Distribution Extensions: • IRIX dplace command • setenv _DSM_PLACEMENT(FIRST_TOUCH|ROUND_ROBIN) • compiler directives: • !$sgi distribute a(*,BLOCK,CYCLIC(5)) • !$sgi distribute_reshape a(*,BLOCK,…)
Default Distribution: “First Touch” • IRIX strategy for data placement: • respect program distribution directives • allocation on the node where data is initialized • allocation on the node where memory is available PARAMETER (PAGESIZE=16384, SIZEDBLE=8) REAL, ALLOCATABLE:: WORK(:)!! Distribution directive not available READ *, MX ALLOCATE(WORK(MX)) !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(WORK,MX) ID = OMP_GET_THREAD_NUM() NP = OMP_GET_NUM_THREADS() NPAGES = MX*SIZEDBLE/PAGESIZE DO I=ID,NPAGES-NP+1,NP!! round-robin distribution of pages: WORK(1+I*PAGESIZE/SIZEDBLE) = 0.0 ENDDO !$OMP END PARALLEL
Data Placement Policies memory memory memory memory memory memory memory memory memory memory memory memory cpu cpu cpu cpu cpu cpu cpu cpu cpu cpu cpu cpu HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB cpu cpu cpu cpu cpu cpu cpu cpu cpu cpu cpu cpu • Initialization with the “first touch” policy on a single processor • all data on a single node • bottle neck in access to that node • Initialization with the “first touch” policy with multiple processors • data is distributed naturally • each processor has local data to run on • minimal data exchange between nodes • page edge effects • Initialization with the “round robin” policy with a single processor • circular page placement on the nodes • on average, all pages are remote • no single bottle neck • significant traffic between nodes
Example: Data Distribution • To avoid the bottle neck resulting from memory allocation on a single node: • Perform initialization in parallel, such that each processor initializes data that it is likely to access later for calculation • Use explicit data placement compiler directives (e.g. C$ distribute ) • Enable environment variable for data placement (setenv _DSM_PLACEMENT) • Use dplace command (with appropriate options) to start the program execution REAL*8 A(N),B(N),C(N),D(N) C$DISTRIBUTE A(BLOCK),B(BLOCK),C(BLOCK),D(BLOCK) DO I=1,N A(I) = 0.0 B(I) = I/2 C(I) = I/3 D(I) = I/7 ENDDO C$OMP PARALLEL DO DO I=1,N A(I) = B(I) + C(I) + D(I) ENDDO
Data Initialization: Example • The F90 allocatable arrays should be initialized explicitly to place the pages: PARAMETER (PAGESIZE=16384, SIZEDBLE=8) REAL, ALLOCATABLE:: WORK(:) READ *, MXRVAR ALLOCATE(WORK(MXRVAR)) !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(WORK,MXRVAR) ID = OMP_GET_THREAD_NUM() NP = OMP_GET_NUM_THREADS() NBLOCK = MXRVAR/NP NPAGES = MXRVAR*SIZEDBLE/PAGESIZE NLOWER = ID*NBLOCK NUPPER = NLOWER + NBLOCK NLOWER = NLOWER + 1 NUPPER = MIN(NUPPER,MXRVAR) c round-robin distribution of pages: DO I=ID,NPAGES-NP+1,NP WORK(1+I*PAGESIZE/SIZEDBLE) = 0.0 ENDDO !$OMP END PARALLEL
Data Distribution Directives • Compiler supports the following data distribution directives: • !$SGI DISTRIBUTE A(shape[,shape]…) ONTO(n[,m]…) #pragma distribute … • define regular distribution of an array with given shape • the shape can be • BLOCK • CYCLIC(k) • * (default: do not distribute) • the placement of the pages will not violate Fortran/C rules for data usage and therefore the exact shape is not guaranteed, in particular no padding at page boundary • !$SGI DISTRIBUTE_RESHAPE A(shape[,shape]…) ONTO(n[,m]…) • #pragma distribute_reshape … • define reshaped distribution of an array with given shape, i.e exact mapping of the data to the given shape. The shape parameters are as for the DISTRIBUTE directive • !$SGI DYNAMIC A #pragma dynamic … • permit dynamic redistribution of an array A • !$SGI REDISTRIBUTE A(shape[,shape]…) #pragma redistribute … • executable statement; forces redistribution of a (previously distributed) array A • !$SGI PAGE_PLACE A(…) #pragma page_place … • specify distribution by pages • !$SGI AFFINITY(i)=[DATA(A(I))|THREAD(I)] #pragma affinity … • the affinity clause specifies that iteration i should execute on the processor who’s memory contains element A(i) or which runs THREAD i
Data Distribution: Examples/1 1) !$SGI DISTRIBUTE X(BLOCK) 2) X(CYCLIC) or X(CYCLIC(1)) 3) X(CYCLIC(3)) Y(*,BLOCK) 4) Real*8 Y(1000,1000) c$sgi distribute Y(*,block) each individual portion is a contiguous piece of size 8*106/P I.e. for 4 processors it is 2 MB/piece Y(BLOCK,*) Real*8 Y(1000,1000) c$sgi distribute Y(block,*) each individual portion is still of size 8*106/P, but each contiguous piece should be only 8*103/P, I.e. for 4 processors it is 2 KB (<16 KB page size) 5) Y(BLOCK,BLOCK) 6) • Regular data distributions to 4 processors: • if the requested distribution incurs significant page boundary effects (example: (5)) - reshaping of array data is necessary
Data Distributions: Examples/2 7) Y(BLOCK,BLOCK) Should be constant integer 8) Y(BLOCK,BLOCK) ONTO (2,3) 9) Y(BLOCK,BLOCK) ONTO (1,*) • Regular data distributions on 6 processors; • the ONTO clause changes the aspect ratio: • Regular data distribution can be re-distributed: • C$SGI DISTRIBUTE A(BLOCK,BLOCK,BLOCK) ONTO (2,2,8) • C$SGI DYNAMIC A • … • c - executable statement, can appear anywhere in the code: • C$SGI REDISTRIBUTE A(BLOCK,BLOCK,BLOCK) ONTO (1,1,32)
Data Reshaping Restrictions/1 • The DISTRIBUTE_RESHAPE directive declares that the program makes no assumptions about the storage lay out of the array • Array reshaping is legal only under certain conditions : • an array can be declared with only one of these options • default (no (re-)shaping) • distribute • distribute_reshape • it will keep the declared attribute for the duration of the program • reshaped array cannot be re-distributed • a reshaped array cannot be equivalenced to another array • explicitly through an equivalence statment • implicitly through multiple declarations of a common block • if an array in a common block is reshaped, all declarations of that common block must: • contain an array at the same offset within the common block • declare that array with the same number of dimensions and same size • specify the same reshaped distribution for the array
Data Reshaping Restrictions/2 • when reshaped array is passed as an actual argument to a subroutine: • if reshaped array is passed as a whole (I.e. call sub(A)), the formal parameter should match exactly the number of dimensions and their size • if only elements of reshaped array are passed ( I.e. call sub(A(0)) or call sub(A(I)), the called procedure treats the incoming parameter as non-distributed • for elements of reshaped array, the declared bounds on the formal parameter are required not to exceed the size of the distributed array portion passed in as actual argument. Example: • compiler performs consistency check on the legality of the reshaping: • -MP:dsm=ON (default) • -MP:check_reshape={ON|OFF} (default OFF) generation of run-time consistency checks • -MP:clone={ON|OFF} (default ON) automatically generate clones of procedures called with the reshaped parameters Real*8 A(1000) c$ distribute_reshape A(cyclic(5)) do I=1,1000,5 call mysub(A(I)) enddo end subroutine mysub(X) real*8 X(5) … end
Data Reshaping Restrictions/3 • Consider also the following restrictions: • BLOCK DATA (i.e. statically) initialized data cannot be reshaped • alloca(3) or malloc(3) arrays cannot be reshaped • F90: a dynamically sized automatic array can be distributed, but an F90 allocatable array is not • cannot mix I/O for a reshaped array with namelist I/O of a function call in the same I/O statement • a common block containing a reshaped array cannot be linked -Xlocal
Data Distribution: Affinity • usage of AFFINITY clause: • data form: thread form: • iterations are scheduled such that each processor does iterations over data that is placed in the memory near to that processor REAL*8 A(N),B(N),Q !$SGI DISTRIBUTE A(BLOCK),B(BLOCK) C INITIALIZATION: DO I=1,N A(I) = 1 - I/2 B(I) = -10 + 0.01*(I*I) ENDDO C REAL WORK: DO IT=1,NITERS Q = 0.01*IT C$OMP PARALLEL DO C$SGI AFFINITY(I)=DATA(A(I)) DO I=1,N A(I) = A(I) + Q*B(I) ENDDO ENDDO REAL*8 A(N),B(N),Q !$SGI DISTRIBUTE A(BLOCK),B(BLOCK) C INITIALIZATION: DO I=1,N A(I) = 1 - I/2 B(I) = -10 + 0.01*(I*I) ENDDO C REAL WORK: DO IT=1,NITERS Q = 0.01*IT C$OMP PARALLEL DO C$SGI AFFINITY(I)=thread(I/(N/P)) DO I=1,N A(I) = A(I) + Q*B(I) ENDDO ENDDO
The dplace Program • dplace - shell level tool for Data Placement: • declare number of threads to use • declare number of memories to use • specify the topology of the memories • specify the distribution of threads on to memories • specify the distribution of virtual addresses on to memories • specify text/data/stack page sizes • specify migration policy and threshold • dynamically migrate virtual address ranges between memories • dynamically move threads from one node to another • see man dplace(1) • Note: dplace switches off the _DSM_… environment variables dplace [-place file] [-verbose][-mustrun] [-data_pagesize size] [-stack_pagesize size] [-text_pagesize size] [-migration threshold] programarguments
dplace: Placement File • The file must include: • the number of memories being used and their topology: • memories M in topology [none|cube|cluster|physical near] • number of threads • threads N • assignment of threads to memories: • run thread n on memory m [using cpu k] • distribute threads [t0:t1:[:dt]] across memories [m0:m1[:dm]] [block [m]] | [cyclic [k]] • optional specifications for policy: • policy stack|data|text pagesize N [k|K] • policy migration n [%] • mode verbose [on|off|toggle] • Example: memories ($OMP_NUM_THREADS+1)/2 in topology cube threads $OMP_NUM_THREADS distribute threads across memories
Data Distribution: page_place • Useful for irregular and dynamically allocated data structures: • #pragma page_place (A, index-range, thread-id) • C$SGI PAGE_PLACE (A, index-range, thread-id) • Note: page_place is useful only for initial placement of the data, it has no effect if pages have been created already Parameter (n=8*1024*1024) real*8 a(n) p=1 c$ p = omp_get_max_threads() c- distribute using block mapping npp=(n + p-1)/p ! Number of elements per processor c$omp parallel do do I=0,p-1 c$sgi page_place (a(1+I*npp), npp*8, I) enddo
Environment Variables: Process • OMP_NUM_THREADS (default max(8,#cpu)) • sets the number of threads to use during execution. If OMP_DYNAMIC is set to TRUE, this is the maximum number of threads • OMP_DYNAMIC TRUE|FALSE (default TRUE) • enables or disables dynamic adjustment of parallel threads to load on the machine. The adjustment can take place at the start of each parallel region Setting OMP_DYNAMIC to FALSE will switch MPC_GANG ON • MPC_GANG ON|OFF (default OFF) • With gang scheduling all parallel threads are started simultaneously and the time quantum is twice as long as normal. To obtain higher priority for parallel jobs it is often necessary to use: • setenv OMP_DYNAMIC FALSE • setenv MPC_GANG OFF • _DSM_MUSTRUN (default not set) • to lock threads to a cpu where it happened to be started • _DSM_PPM (default 2/4 processes per node) • if _DSM_PPM is set to 1 only one process will be started on each node
Environment Variables: Size • MP_SLAVE_STACKSIZE (default 16 MB (4 MB >64 threads)) • stack size of the slave processes. The stack space is private, an adjustment of this size might be necessary if machine has constrains on memory • MP_STACK_OVERFLOW (default ON) • stack size checking. Some programs overflow their allocated stack, while still functioning correctly. This variable has to be set OFF in that case. • PAGESIZE_STACK (default 16 KB) • PAGESIZE_DATA (default 16 KB) • PAGESIZE_TEXT (default 16 KB) • specifies the page size for stack, data and text segments. Sizes can be specified in KBytes or Mbytes in power of 2: 64 KB, 256 KB, 1 MB, 4 MB. If the system does not have requested pages available, it might choose smaller page size silently.
Environment Variables: DSM • _DSM_PLACEMENT (default FIRST_TOUCH) • memory allocation policy for stack, data and text segments. This variable can be set to the following values: • FIRST_TOUCH allocate pages close to the process which requested them • ROUND_ROBIN allocate pages alternating the nodes where processes run • _DSM_OFF (default not set) • data placement directives will not be honoured if _DSM_OFF is set to ON. In particular, that is the mechanism that dplace program is using to force the specified data distribution in an executable. E.g. • dplace -place file myprogram • will set _DSM_OFF to ON and proceed with specifications in file. • _DSM_VERBOSE (default not set) • when set, reports memory allocation decisions to stdout. • man pe_environ(5) page for other useful environment variables.
Case Study: Maxwell Code • For scalar optimization, NX,NY,NZ have been adjusted to uneven values and arrays H,E have been padded to avoid cache trashing • Now parallelization can be attempted to further increase performance on a parallel machine • automatic parallelization with the -apo compiler option • manual parallelization approach • Study the performance implications of the different parallelization approaches REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric field REAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field … DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB)
Case Study: Maxwell Code Automatic Parallelization DO K=1,NZ DO J=1,NY DO I=1,NX HX(I,J,K)=1E-3/(I+J+K) EX(I,J,K)=1E-3/(I+J+K) …. ENDDO ENDDO ENDDO 1 DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO 2 DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 EX(I,J,K)=EX(I,J,K)-(HZ(I,J,K)-HZ(I,J-1,K))*CHDY +(HY(I,J,K)-HY(I,J,K-1))*CHDZ EY(I,J,K)=EY(I,J,K)-(HX(I,J,K)-HX(I,J,K-1))*CHDZ +(HZ(I,J,K)-HZ(I-1,J,K))*CHDX EZ(I,J,K)=EZ(I,J,K)-(HY(I,J,K)-HY(I-1,J,K))*CHDX +(HX(I,J,K)-HX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO 3 • The Maxwell program is automatically parallelized with: f77 -apo -O3 maxwell.f • There are 3 parallel sections: • Initialization • Calculation of H • Calculation of E • There are no data dependencies in each of the lops; • The compiler divided k-loop iterations among N processors Iterations (for statistics)
Scalability of Auto-Parallelization Super-linear speedup when data fits into the cache Slow down: load balancing the k-loop range is too small The parallel k-loop DO K=2,NZ-1 therefore the performance peaks in “perfect” points 62/2 and 126/2
Case Study: Workload Distribution Parallelization with directives c$omp parallel do c$sgi nest(k,j) in front of the properly nested loop nest
Case Study: Data Placement Parallel initialization can be forced to be done serially with the directive c*$* assert do(serial) Data distribution directive: c$distribute Data placement with environment variable: setenv _DSM_PLACEMENT ROUND_ROBIN There is no single bottle neck, but most data is remote Write back traffic Most data is in cache Data traffic from/to a single node limits the scalability since the default page placement strategy is the “first touch” strategy, all pages end up in a memory of a single node where the program starts.
Parallel Thread Creation • There are several mechanisms to create a parallel processing stream: • Library call: • compiler level (compiler directives) • automatic parallelization (-apo) • At low level all shared memory parallelization pass from sproc call to create threads of control with global address space sproc() (Shared Memory Fork) m_fork() (simpler sproc) pthread_create (Posix threads) start_pes (shmem init routine) Fork() UNIX MPI_init() MPI start up Shared address space Disjoint address space C$doacross (SGI old syntax) c$par parallel (ANSI X3H5 spec) !$omp parallel (OpenMP specs) #pragma omp parallel
Compile Time Actions • Each parallel region (either found automatically or via directive): • is converted into a subroutine call executed by each parallel thread • The subroutine contains the execution body of the parallel region • the arguments to the subroutine contain scheduling parameters (loop ranges) • the new subroutine name: • derived from _mpdo_sub_# • will appear in the profile • data in the parallel region: • private data: • local (automatic) variables in the newly created subroutine • shared data • will be passed well known address to the new subroutine Subroutine daxpy(n,a,x,y) real*8 a,x(n),y(n) integer*4 args _LOCAL_NTRIP = n/__mp_sug_numthreads_func$() _LOCAL_START = 1+__mp_my_thread_id()*_LOCAL_NTRIP _INCR = 1 args(1) = _LOCAL_START args(2) = _LOCAL_NTRIP args(3) = _INCR args(4) = 0 call sproc$(_mpdo_daxpy_1,%VAL(PR_SALL),args) END subroutine _mpdo_daxpy_1(_LOCAL_START, _LOCAL_NTRIP,_INCR,_INF) do I=_LOCAL_START,_LOCAL_NTRIP y(I) = y(I)+a*x(I) enddo end Subroutine daxpy(n,a,x,y) real*8 a,x(n),y(n) do I=1,n y(I) = y(I)+a*x(I) enddo end
Run Time Actions __mp_slave_wait_for_work() Master process (p0) p1 p1 par_loop sproc() create shared memory process group mp_setup() or !$omp parallel p2 par_loop p2 __mp_barrier_nthreads() Serial execution pn-1 par_loop pn-1 barrier __mp_barrier _master() __mp_simple_sched(par_loop) p0 par_loop !$omp parallel __mp_wait_for_completion() • When the parallel threads are spawn: • they spin in __mp_slave_wait_for_work() • Master enters parallel region: • passes the address of the _mpdo subroutine to the slaves • at the end of parallel region all threads synchronize • slaves go back to spin in __mp_slave_wait_for_work() • master continues with serial execution Serial execution
Timing and Profiling MP Program • To obtain timing information for an MP program • setenv OMP_NUM_THREADS n • time a.out • example: n=1 time output: 9.678u 0.091s 0:09.86 98.9% 0+0k 0+1io 0pf+0w n=2 time output: 10.363u 0.121s 0:05.72 183.2% 0+0k 3+8io 3pf+0w from this output we derive the following performance data: processors User+System Elapsed 1 9.77 9.86 2 10.48 5.72 therefore the parallel overhead is 7.3%, the Speedup is 1.72 • use Speedshop to obtain profiling information for an MP program • setenv OMP_NUM_THREADS n • ssrun -exp pcsamp a.out • each thread will create a separate (binary) profiling file • foreach file (a.out.pcsamp.*) • prof $file • end 7.3% S(2) = 9.86/5.72
MP SpeedShop Example Output • Slave execution: • 103 1s( 20.6) 1s( 20.6) divfree (./CWIROS:DIVFREE.f) • 96 0.96s( 19.2) 2s( 39.7) ROWKAP (./CWIROS:KAPNEW.f) • 67 0.67s( 13.4) 2.7s( 53.1) ITER (./CWIROS:CHEMIE.f) • 40 0.4s( 8.0) 3.1s( 61.1) FLUXY (./CWIROS:KAPNEW.f) • 35 0.35s( 7.0) 3.4s( 68.1) __expf ( .. fexp.c) • 30 0.3s( 6.0) 3.7s( 74.1) __mp_barrier_nthreads (..libmp.so) • 26 0.26s( 5.2) 4s( 79.2) FLUXX (./CWIROS:KAPNEW.f) • 24 0.24s( 4.8) 4.2s( 84.0) KAPPA (./CWIROS:KAPNEW.f) • 23 0.23s( 4.6) 4.4s( 88.6) __mp_slave_wait_for_work (.. libmp.s) • 11 0.11s( 2.2) 4.5s( 90.8) TWOSTEP (./CWIROS:CHEMIE.f) • Master execution: • 124 1.2s( 24.0) 1.2s( 24.0) divfree (./CWIROS:DIVFREE.f) • 93 0.93s( 18.0) 2.2s( 42.1) ROWKAP (./CWIROS:KAPNEW.f) • 76 0.76s( 14.7) 2.9s( 56.8) ITER (./CWIROS:CHEMIE.f) • 36 0.36s( 7.0) 3.3s( 63.8) FLUXY (./CWIROS:KAPNEW.f) • 27 0.27s( 5.2) 3.6s( 69.0) __expf (.. libm/fexp.c) • 23 0.23s( 4.5) 3.8s( 73.4) FLUXX (./CWIROS:KAPNEW.f) • 22 0.22s( 4.3) 4s( 77.7) KAPPA (./CWIROS:KAPNEW.f) • 15 0.15s( 2.9) 4.2s( 80.6) __mp_barrier_master (.. libmp.so) • 15 0.15s( 2.9) 4.3s( 83.5) diff (./CWIROS:DIFF.f) • 9 0.09s( 1.7) 4.4s( 85.3) TWOSTEP (./CWIROS:CHEMIE.f) • compare times (seconds) for each important subroutine to determine load balance • evaluate contribution of libmp: pure overhead