190 likes | 333 Views
An Evaluation of Auto-Scoping in OpenMP. Michael Voss, Eric Chiu, Patrick Chow, Catherine Wong and Kevin Yuen ECE Department University of Toronto. An Overview of Auto-scoping. Dieter an Mey proposed Auto-scoping as an extension to OpenMP ( www.cOMPunity.org )
E N D
An Evaluation of Auto-Scoping in OpenMP Michael Voss, Eric Chiu, Patrick Chow, Catherine Wong and Kevin Yuen ECE Department University of Toronto
An Overview of Auto-scoping • Dieter an Mey proposed Auto-scoping as an extension to OpenMP (www.cOMPunity.org) • Relieve users from burden of explicit scoping • error prone • tedious • compromise: explicit and automatic parallelization • analysis is similar to automatic parallelization • successful in 1 of 2 scientific programs WOMPAT 2004
C$OMP PARALLEL DO SHARED(A,B) C$OMP&PRIVATE(I,J) DO I = 1,100 DO J = 1,100 A(I,J) = A(J,I) + B(I,J) ENDDO ENDDO C$OMP END PARALLEL DO Using DEFAULT(AUTO) C$OMP PARALLEL DO C$OMP&DEFAULT(AUTO) DO I = 1,100 DO J = 1,100 A(I,J) = A(J,I) + B(I,J) ENDDO ENDDO C$OMP END PARALLEL DO WOMPAT 2004
Outline of Talk • Introduction • Implementing DEFAULT(AUTO) in Polaris • An evaluation of DEFAULT(AUTO) in Polaris • comparison with EA Sun Studio 9 F95 compiler • A Discussion of runtime support • Related Work • Conclusion WOMPAT 2004
Implementing DEFAULT(AUTO) in Polaris • Polaris is auto-parallelizer for Fortran 77 • Supports a range of advanced techniques • The Range Test • The Omega Test • Array and Scalar Privatization • Array and Scalar Reduction Recognition • Induction Variables Substitution • Interprocedural Constant Propagation • Most Interprocedural Optimization by Inlining WOMPAT 2004
Polaris as an OMP to OMP Translator Polaris Parser DDtest pass Reduction pass Privatization pass … OpenMP Backend Fortran 77 Fortran 77 + OpenMP Polaris Parser Moerae Backend Fortran 77 + Moerae calls Fortran 77 + OpenMP Original automatic parallelization path OpenMP to explicitly threaded code path New OpenMP to OpenMP path WOMPAT 2004
Supporting DEFAULT(AUTO) • Parse DEFAULT(AUTO) • React appropriately to user directives • selective loop parallelization • no changes without AUTO directive • user scoping overrides Polaris scoping • can parallelize loops that cannot be fully auto-scoped • Limitations • only regions with PARALLEL DO semantics • bails out on general parallel regions WOMPAT 2004
Example 1: No explicit scoping !$OMP PARALLEL DEFAULT(AUTO) DO N = 1,7 DO M = 1,7 !$OMP DO DO L = LSS(itsub),LEE(itsub) I = IG(L) J = JG(L) K = KG(L) LIJK = L2IJK(L) RHS(L,M) = RHS(L,M) + - FJAC(LIJK,LM00,M,N)*DQCO(i-1,j,k,n,NB)*FM00(L) + - FJAC(LIJK,LP00,M,N)*DQCO(i+1,j,k,n,NB)*FP00(L) + - FJAC(LIJK,L0M0,M,N)*DQCO(i,j-1,k,n,NB)*F0M0(L) + - FJAC(LIJK,L0P0,M,N)*DQCO(i,j+1,k,n,NB)*F0P0(L) ENDDO !$OMP END DO NOWAIT ENDDO ENDDO !$OMP END PARALLEL WOMPAT 2004
Example 1: No explicit scoping !$OMP PARALLEL !$OMP+DEFAULT(SHARED)!$OMP+PRIVATE(M,L,N) DO n = 1, 7, 1 DO m = 1,7, 1 !$OMP DO DO l = lss(itsub), lee(itsub), 1 rhs(l, m) = rhs(l, m)+(-dqco(ig(l), (-1)+jg(l), kg(l), n, nb))* *f0m0(l)*fjac(l2ijk(l), l0m0, m, n)+(-dqco(ig(l), 1+jg(l), kg(l), n *, nb))*f0p0(l)*fjac(l2ijk(l), l0p0, m, n)+(-dqco((-1)+ig(l), jg(l) *, kg(l), n, nb))*fjac(l2ijk(l), lm00, m, n)*fm00(l)+(-dqco(1+ig(l) *, jg(l), kg(l), n, nb))*fjac(l2ijk(l), lp00, m, n)*fp00(l) ENDDO !$OMP END DO NOWAIT ENDDO ENDDO !$OMP END PARALLEL WOMPAT 2004
Example 2: Explicit scoping SUBROUTINE RECURSION(n,k,a,b,c,d,e,f,g,h,s) REAL*8 A(*),B(*),C(*),D(*),E(*),F(*),G(*),H(*) REAL*8 T,S INTEGER N,K,I S = 0.0D0 C$OMP PARALLEL SHARED(D) C$OMP+DEFAULT(AUTO) C$OMP DO DO I = 1,N T = F(I) + G(I) A(I) = B(I) + C(I) D(I+K) = D(I) + E(I) H(I) = H(I) * T S = S + H(I) END DO C$OMP END DO C$OMP END PARALLEL END WOMPAT 2004
Example 2: Explicit scoping SUBROUTINE recursion(n, k, a, b, c, d, e, f, g, h, s) DOUBLE PRECISION a, b, c, d, e, f, g, h, s, t INTEGER*4 i, k, n DIMENSION a(*), b(*), c(*), d(*), e(*), f(*), g(*), h(*) s = 0.0D0 !$OMP PARALLEL !$OMP+DEFAULT(SHARED) !$OMP+PRIVATE(T,I) !$OMP DO !$OMP+REDUCTION(+:s) DO i = 1, n, 1 t = f(i)+g(i) a(i) = b(i)+c(i) d(i+k) = d(i)+e(i) h(i) = h(i)*t s = h(i)+s ENDDO !$OMP END DO !$OMP END PARALLEL RETURN END WOMPAT 2004
Evaluation of DEFAULT(AUTO) • Fortran 77 Benchmarks from SPEC OpenMP • removed all explicit scoping • added DEFAULT(AUTO) to all regions • used Omni OpenMP compiler as backend (-O2) • Explicit speedup –vs- auto-scope speedup • four processor Xeon server • 1.8 GHz processors, 16 GBytes main memory • Hyperthreaded, but only used 1 thread per CPU • Also used EA Sun Studio 9 Fortran 95 compiler • supports DEFAULT(__AUTO) • report number of regions auto-scoped WOMPAT 2004
Performance of Auto-scoping Sun results are for the Early Access Version of the Sun Microsystems Studio 9 Fortran 95 compiler. WOMPAT 2004
Discussion • Many regions were not fully analyzable • Polaris could not fully inline the regions • several regions were general parallel regions • Early Access Sun Studio 9 compiler • auto-scoped fewer regions in general • missed important regions in Swim and Mgrid • regions could be parallelized but not auto-scoped • Sun compiler could auto-scope some regions that Polaris could not • can analyze general parallel regions WOMPAT 2004
A general parallel region from WupwisePolaris fails but the Sun compiler succeeds C$OMP PARALLEL DEFAULT(AUTO) LSCALE = ZERO LSSQ = ONE C$OMP DO DO IX = 1, 1 + (N - 1) *INCX, INCX IF (DBLE (X(IX)) .NE. ZERO) THEN ... LSSQ = ONE + LSSQ* (LSCALE / TEMP) ** 2 LSCALE = TEMP END IF ... END DO C$OMP END DO C$OMP CRITICAL IF (SCALE .LT. LSCALE) THEN SSQ = ((SCALE / LSCALE) ** 2) * SSQ + LSSQ SCALE = LSCALE ELSE SSQ = SSQ + ((LSCALE / SCALE) ** 2) * LSSQ END IF C$OMP END CRITICAL C$OMP END PARALLEL WOMPAT 2004
Runtime Support for Auto-scoping • add speculate directive for regions that cannot be auto-scoped • applies to very few regions in SPEC OpenMP • requires interprocedural marking of reads/writes • only 2 regions not auto-scoped can be fully analyzed !$OMP PARALLEL !$OMP+DEFAULT(SHARED) !$OMP+PRIVATE(U51K,U41K,U31K,Q,U21K,M,K,I,U41,U31KM1,U51KM1,U21KM1) !$OMP+PRIVATE(U41KM1,TMP,J) !$OMP+SPECULATE(UTMP,RTMP) !$OMP DO !$OMP+LASTPRIVATE(FLUX2) DO j = jst, jend, 1 ... ENDDO !$OMP END DO !$OMP END PARALLEL (a region from the RHS subroutine of Applu) WOMPAT 2004
Related Work • DEFAULT(AUTO) proposed by Dieter an Mey • Many commercial and research auto-parallelizers • Polaris, SUIF, CAPO, … • Perform parallelization and scoping • The EA Sun Studio 9 Fortran 95 Compiler • paper also here at WOMPAT • thanks to Yuan Lin for pointing me to it • Runtime dependence testing • Saltz, Rauchwerger, … WOMPAT 2004
Conclusion • Implemented DEFAULT(AUTO) in Polaris • created full OpenMP to OpenMP translator • added facilities for auto-scoping • Evaluated implementation • 2 of 5 benchmarks fully auto-scoped • remainder showed significant loss of speedup • results different from EA Sun compiler • performance not portable across compilers • Discussed speculative parallelization support WOMPAT 2004
Conclusion cont… • Combination of loop and region analyzer • Polaris auto-scoped more regions • Sun compiler can handle general regions • Performance not be portable across compilers • never is but… • sacrifice performance for convenience • perhaps a useful tool during manual parallelization • Future work • general region support in Polaris WOMPAT 2004