Auto-Parallelizing Option

Auto-Parallelizing Option John Matrow, M.S. System Administrator/Trainer WSU High Performance Computing Center (HiPeCC)

Outline • Compiler * Options * Output • Incomplete Optimization * Does not detect a loop is safe to parallelize * Parallelizes the wrong loop * Unnecessarily parallelizes a loop • Strategies for Assisting APO WSU High Performance Computing Center (HiPeCC)

Auto-Parallelizing Option (APO) The MIPSpro Auto-Parallelizing Option (APO) from SGI is used to automatically detect and exploit parallelism in Fortran 77, Fortran 90, C and C++ programs. WSU High Performance Computing Center (HiPeCC)

SGI MIPSpro compilers • APO • IPA (interprocedural analysis) • LNO (loop nest optimization) WSU High Performance Computing Center (HiPeCC)

Syntax • f77/cc: -apo[{list|keep}] [-mplist] [-On] • f90/CC: -apo[{list|keep}] [-On] WSU High Performance Computing Center (HiPeCC)

Syntax • -apo list: produce a .l file, a listing of those parts of the program that can run in parallel and those that cannot • -apo keep: produce .l, .w2c.c, .m and .anl files. Do not use with –mplist • -mplist: Generate the equivalent program for f77 in .w2f.f file or for c in a .w2c.c file • -On: optimization level, 3= aggressive (recommended) WSU High Performance Computing Center (HiPeCC)

Link If you link separately, you must have one of the following in the command line: • The –apo flag • The –mp option WSU High Performance Computing Center (HiPeCC)

Interprocedural Analysis (IPA) • Procedure inlining • Identification of global constants • Dead function elimination • Dead variable elimination • Dead call elimination • Interprocedural alias analysis • Interprocedural constant propagation WSU High Performance Computing Center (HiPeCC)

Loop Nest Optimization (LNO) • Loop interchange • Loop fusion • Loop fission • Cache blocking and outer loop unrollingLNO runs when you use the –O3 option WSU High Performance Computing Center (HiPeCC)

Sample source SUBROUTINE sub(arr, n) REAL*8 arr(n) DO i = 1, n arr(i) = arr(i) + arr(i-1) END DO DO i = 1, n arr(i) = arr(i) + 7.0 CALL foo(a) END DO DO I = 1, n arr(i) = arr(i) + 7.0 END DO END WSU High Performance Computing Center (HiPeCC)

Sample APO listing Parallelization log for Subprogram sub_ 3: Not Parallel Array dependence from arr on line 4 to arr on line 4. 6: Not Parallel Call foo on line 8 10: PARALLEL (Auto) _mpdo_sub_1 WSU High Performance Computing Center (HiPeCC)

Sample source listing C PARALLEL DO will be converted to SUBROUTINE _mpdo_sub_1 C$OMP PARALLEL DO private (i), shared (a) DO I = 1, 10000, 1 a(I) = 0.0 END DO WSU High Performance Computing Center (HiPeCC)

Running Your Program • Environment variable used to specify the number of threads: OMP_NUM_THREADS • Example:setenv OMP_NUM_THREADS 4 WSU High Performance Computing Center (HiPeCC)

Running Your Program • Environment variable used to allow a dynamic number of threads to be used (as available): OMP_DYNAMIC • Example:setenv OMP_DYNAMIC FALSE • Default: TRUE WSU High Performance Computing Center (HiPeCC)

Incomplete Optimization • Does not detect a loop is safe to parallelize • Parallelizes the wrong loop • Unnecessarily parallelizes a loop WSU High Performance Computing Center (HiPeCC)

Failing to Parallelize Safe Loops Does NOT parallelize loops containing: • Data dependencies* • Function calls • GO TO Statements* • Problematic Array Subscripts • Conditionally Assigned Temporary Nonlocal Variables • Unanalyzable Pointer Usage (C/C++)*not discussed here WSU High Performance Computing Center (HiPeCC)

Function Calls You can tell APO to ignore dependencies of function calls by using • Fortran:C*$* ASSERT CONCURRENT CALL • C/C++:#pragma concurrent call WSU High Performance Computing Center (HiPeCC)

Problematic Array Subscripts Too complicated: • Indirect array referencesA(IB(I)) = . . . • Unanalyzable subscripts Allowable elements: literal constants, variables, product, sum, difference • Rely on hidden knowledgeA(I) = A(I+M) WSU High Performance Computing Center (HiPeCC)

Conditionally Assigned Temporary Nonlocal Variables SUBROUTINE S1(A,B) COMMON T DO I = 1, N IF B(I) THEN T = . . . A(I) = A(I) + T END OF END DO CALL S2() END WSU High Performance Computing Center (HiPeCC)

Unanalyzable Pointer Usage (C/C++) • Arbitrary pointer dereferences • Arrays of arraysUse p[n][n] instead of **p • Loops bounded by pointer comparisons • Aliased parameter informationUse __restrict type qualifier to say arrays do not overlap WSU High Performance Computing Center (HiPeCC)

Parallelizing the Wrong Loop • Inner Loops • Small Trip Counts • Poor Data Locality WSU High Performance Computing Center (HiPeCC)

Inner Loops • APO tries to parallelize the outermost loop, after possibly interchanging loops to make a more promising one outermost • If the outermost loop attempt fails, APO parallelizes an inner loop if possible • Inner loop parallelized probably because of “Failing to Parallelize Safe Loops” discussed earlier • Probably advantageous to modify code so the outermost loop is the one parallelized WSU High Performance Computing Center (HiPeCC)

Small Trip Counts • Small trips counts generally run faster when they are not parallelized • Use AssertionC*$* ASSERT DO PREFER#pragma prefer • Use manual parallelization directives WSU High Performance Computing Center (HiPeCC)

Poor Data Locality DO I = 1, N . . .A(I) END DO DO I = N, 1, -1 . . .A(I). . . END DO WSU High Performance Computing Center (HiPeCC)

Poor Data Locality DO I = 1, N DO J = 1, N A(I,J) = B(J,I) + . . . END DO END DO DO I = 1, N DO J = 1, N B(I,J) = A(J,I) + . . . END DO END DO WSU High Performance Computing Center (HiPeCC)

Incurring Unnecessary Parallelization Overhead • Unknown Trip Counts • Nested parallelism WSU High Performance Computing Center (HiPeCC)

Unknown Trip Counts • If the trip count is not known (and sometimes even if it is), APO parallelizes the loop conditionally • It generates code for both a parallel and a sequential version • APO can avoid running in parallel if the loops turns out to have a small trip count • Choice also includes number of processors available, overhead cost, code inside loop WSU High Performance Computing Center (HiPeCC)

Nested Parallelism SUBROUTINE CALLER DO I = 1, N CALL SUB END DO END SUBROUTINE SUB DO I = 1, N . . . END DO END WSU High Performance Computing Center (HiPeCC)

Strategies for Assisting APO • Modify code to avoid coding practices that will not analyze well • Manual parallelization options [OpenMP] • Use APO directives to give APO more information about code WSU High Performance Computing Center (HiPeCC)

Compiler Directives for Automatic Parallelization • C*$* [NO] CONCURRENTIZE • C*$* ASSERT DO (CONCURRENT|SERIAL) • C*$* ASSERT CONCURRENT CALL • C*$* ASSERT PERMUTATION (array_name) • C*$* ASSERT DO PREFER (CONCURRENT|SERIAL) WSU High Performance Computing Center (HiPeCC)

Compiler Directives • The following affect compilation even if –apo is not specified:C*$* ASSERT DO (CONCURRENT)C*$* ASSERT CONCURRENT CALLC*$* ASSERT PERMUTATION • -LNO:ignore_pragmas causes APO to ignore all directives, assertions and pragmas WSU High Performance Computing Center (HiPeCC)

C*$* NO CONCURRENTIZE • Place inside subroutine • Place outside subroutine to affect all subroutines • C*$* CONCURRENTIZE used to overrideC*$* NO CONCURRENTIZE placed outside of it WSU High Performance Computing Center (HiPeCC)

C*$* ASSERT DO (CONCURRENT) • Tells APO to ignore array dependencies • Applying to inner loop may cause loop to be made outermost by loop interchange • Does not affect CALL • Ignored if obvious real dependencies found • If multiple loops can be parallelized, it causes APO to prefer loop immediately following the assertion WSU High Performance Computing Center (HiPeCC)

C*$* ASSERT DO (SERIAL) • Do not parallelize the loop following the assertion • APO may parallelize another loop in the same nest • The parallelized loop may be either inside or outside the designated sequential loop WSU High Performance Computing Center (HiPeCC)

C*$* ASSERT CONCURRENT CALL • Applies to the loop that immediately follows it and to all loops nested inside that loop • A subroutine inside the loop cannot read from a location that is written to during another iteration (shared) • A subroutine inside the loop cannot write to a location that is read from or written to during another iteration (shared) WSU High Performance Computing Center (HiPeCC)

C*$* ASSERT PERMUTATION • C*$* ASSERT PERMUTATION (array_name) tells APO that array_name is a permutation array: Every element of the array has distinct value • The array can thus be used for indirect addressing • Affects every loop in subroutine, even those appearing ahead of it WSU High Performance Computing Center (HiPeCC)

C*$* ASSERT DO PREFER • C*$* ASSERT DO PREFER (CONCURRENT) instructs APO to parallelize the following loop if it is safe to do so • With nested loops, if it is not safe, APO uses heuristics to choose among loops that are safe • If applied to inner loop, APO may make it the outer loop • If applied to multiple loops, APO uses heuristics to choose one of the specified loops WSU High Performance Computing Center (HiPeCC)

C*$* ASSERT DO PREFER • C*$* ASSERT DO PREFER (SERIAL) is essentially the same asC*$* ASSERT DO (SERIAL) • Used in cases with small trip counts • Used in cases with poor data locality WSU High Performance Computing Center (HiPeCC)

Example 1: AddOpac.f do nd=1,ndust if( lgDustOn1(nd) ) then do i=1,nupper dstab(i) = dstab(i) + dstab1(i,nd) * dstab3(nd) dstsc(i) = dstsc(i) + dstsc1(i,nd) * dstsc2(nd) end do endif end do 408: Not Parallel Array dependence from DSTAB on line 412 to DSTAB on line 412. Array dependence from DSTSC on line 413 to DSTSC on line 413. WSU High Performance Computing Center (HiPeCC)

Example 1: AddOpac.f C*$* ASSERT DO CONCURRENT before outer DO resulted in: DO ND = 1, 20, 1 IF(LGDUSTON3(ND)) THEN C PARALLEL DO will be converted to SUBROUTINE __mpdo_addopac_10 C$OMP PARALLEL DO if(((DBLE(__mp_sug_numthreads_func$()) *((DBLE( C$& __mp_sug_numthreads_func$()) * 1.23D+02) + 2.6D+03)) .LT.((DBLE( C$& NUPPER0) * DBLE((__mp_sug_numthreads_func$() + -1))) * 6.0D00))), C$& private(I6), shared(DSTAB2, DSTABUND0, DSTAB3, DSTSC2, DSTSC3, ND, C$& NUPPER0) DO I6 = 1, NUPPER0, 1 DSTAB2(I6) = (DSTAB2(I6) +(DSTABUND0(ND) * DSTAB3(I6, ND))) DSTSC2(I6) = (DSTSC2(I6) +(DSTABUND0(ND) * DSTSC3(I6, ND))) END DO ENDIF END DO WSU High Performance Computing Center (HiPeCC)

Example 2:BiDiag.f 135: Not Parallel Array dependence from DESTROY on line 166 to DESTROY on line 137. Array dependence from DESTROY on line 166 to DESTROY on line 144. Array dependence from DESTROY on line 174 to DESTROY on line 166. Array dependence from DESTROY on line 166 to DESTROY on line 166. Array dependence from DESTROY on line 144 to DESTROY on line 166. Array dependence from DESTROY on line 137 to DESTROY on line 166. Array dependence from DESTROY on line 166 to DESTROY on line 174. <more of same> WSU High Performance Computing Center (HiPeCC)

Example 2:BiDiag.f C$OMP PARALLEL DO PRIVATE(ns, nej, nelec, max, ratio) do i=IonLow(nelem),IonHigh(nelem)-1 . . . C$OMP CRITICAL destroy(nelem,max) = destroy(nelem,max) + 1 PhotoRate(nelem,i,ns,1) * ` 1 vyield(nelem,i,ns,nej) * ratio C$OMP END CRITICAL WSU High Performance Computing Center (HiPeCC)

Example 3: ContRate.f 78: Not Parallel Scalar dependence on XMAXSUB. Scalar XMAXSUB without unique last value. Scalar FREQSUB without unique last value. Scalar OPACSUB without unique last value. Solution: same as previous example WSU High Performance Computing Center (HiPeCC)

Exercises • Copy ~jmatrow/openmp/apo*.f • Compile and examine the .list file • Each program requires one change apo1.f – Assertion needed apo2.f – OpenMP directive needed WSU High Performance Computing Center (HiPeCC)

Auto-Parallelizing Option

Auto-Parallelizing Option

Presentation Transcript

Parallelizing Incremental Bayesian Segmentation (IBS)

Parallelizing Dynamic Time Warping

Adaptively Parallelizing Distributed Range Queries

Parallelizing OpenCV: Optical Flow Algorithms

Parallelizing MiniSat

Parallelizing Programs

Parallelizing Security Checks on Commodity Hardware

Parallelizing the Data Cube

Parallelizing the GAP Kernel

Parallelizing HMM Decoding

Parallelizing Spacetime Discontinuous Galerkin Methods

CVS II: Parallelizing Software Development

Parallelizing Computations

Parallelizing METIS

Experiments with auto-parallelizing SPEC2000FP benchmarks

Parallelizing stencil computations

Parallelizing an Image Compression Toolbox

Parallelizing Live Migration of Virtual Machines

Parallelizing an Image Compression Toolbox

Parallelizing Iterative Computation for Multiprocessor Architectures

Parallelizing Data Race Detection