90 likes | 230 Views
The Portland Group, Inc. Brent Leback brent.leback@pgroup.com www.pgroup.com. HPC User Forum, Broomfield, CO September 2009. High Level Languages for Clusters. Many failures in this area, academically and commercially Lack of Supply? Lack of Standards? Bad/Buggy Implementations?
E N D
The Portland Group, Inc. Brent Leback brent.leback@pgroup.com www.pgroup.com HPC User Forum, Broomfield, CO September 2009
High Level Languages for Clusters • Many failures in this area, academically and commercially • Lack of Supply? • Lack of Standards? • Bad/Buggy Implementations? • Lack of Generality? • Lack of Performance? • CAF is headed for the Fortran Standard (?) (!) • Is it a good idea? • Is it mature enough to standardize? • Will anyone in attendance use it? • Given our experience with HPF, PGI will be conservative on this front
Performance Across Platforms: PGI Unified Binary • PGI Unified Binary has been available since 2005 • A single X64 binary including optimized code sequences for multiple target processor cores. • -tp switch to specify target processor type, a number of AMD and Intel processor families currently supported. • Especially important to ISVs • AVX support is in progress • Now PGI Unified Binary supports accelerated/non-accelerated binaries • A single X64 binary recognizes the existence of a GPU and runs PGI accelerated versions there if available. • -ta switch to specify target accelerator, currently only –ta=nvidia is supported. • Use –ta=nvidia,host to generate code for both cases • Target processor and Target Accelerator switches can be used together. Today, Intel64, AMD64, + NVIDIA is the full gamut.
PGI Accelerator Compilers SUBROUTINE SAXPY (A,X,Y,N) INTEGER N REAL A,X(N),Y(N)!$ACC REGION DO I = 1, N X(I) = A*X(I) + Y(I) ENDDO!$ACC END REGION END compile Auto-generated GPU code Host x64 asm File typedef struct dim3{ unsigned int x,y,z; }dim3; typedef struct uint3{ unsigned int x,y,z; }uint3; extern uint3 const threadIdx, blockIdx; extern dim3 const blockDim, gridDim; static __attribute__((__global__)) void pgicuda( __attribute__((__shared__)) int tc, __attribute__((__shared__)) int i1, __attribute__((__shared__)) int i2, __attribute__((__shared__)) int _n, __attribute__((__shared__)) float* _c, __attribute__((__shared__)) float* _b, __attribute__((__shared__)) float* _a ) { int i; int p1; int _i; i = blockIdx.x * 64 + threadIdx.x; if( i < tc ){ _a[i+i2-1] = ((_c[i+i2-1]+_c[i+i2-1])+_b[i+i2-1]); _b[i+i2-1] = _c[i+i2]; _i = (_i+1); p1 = (p1-1); } } saxpy_: … movl (%rbx), %eax movl %eax, -4(%rbp) call __pgi_cu_init . . . call __pgi_cu_function … call __pgi_cu_alloc … call __pgi_cu_upload … call __pgi_cu_call … call __pgi_cu_download … link + Unifieda.out … no change to existing makefiles, scripts, IDEs, programming environment, etc. execute
Supporting Heterogeneous Cores: PGI Accelerator Model • Minimal changes to the language– directives/pragmas, in the same vein as vector or OpenMP parallel directives. As simple as !$ACC REGION <your Fortran kernel here> !$ACC END REGION • Minimal library calls– usually none • Standard x64 toolchain– no changes to makefiles, linkers, build process, standard libraries, other tools • Not a “platform” – binaries will execute on any compatible x64+GPU hardware system • Performance feedback – learn from and leverage the success of vectorizing compilers in the 1970s and 1980s • Incremental program migration – put migration decisions in the hands of developers • PGI Unified Binary Technology – ensures continued portability to non GPU-enabled targets
Programmer Productivity: Compiler-to-Programmer Feedback Directives, Options, RESTRUCTURING HPCUser CCFF HPCCode PGI Compiler Performance x64 CCFF provides:how/when a function was compiled, IPA optimizations, profile feedback runtime values, info on vectorization and parallelization, compute intensity, and missed opportunities + PGPROF Trace Acc
Supporting Third-Parties • PGI 9.0 supports OpenMP 3.0 for Fortran, C/C++. • OpenMP 3.0 Tasks supported in all languages • OpenMP runtime overhead as measured by the EPCC benchmark is lower than our competition • PGI is currently working with the OpenMP committee to investigate the support of an accelerator programming model as part of OpenMP and/or other standards body. • Michael Wolfe is our OpenMP representative • IMSL and NAG are already supported with PGI compilers; we're enabling them to migrate incrementally to heterogeneous manycore.
Availability andAdditional Information • PGI Accelerator Programming Model– is supported for x64+NVIDIA Linux targets in the PGI 9.0 Fortran and C compilers, available now • PGI CUDA Fortran– supporting explicit programming of x64+NVIDIA targets will be available in a production release of the PGI Fortran 95/03 compiler currently scheduled for release in November, 2009 • Other GPU and Accelerator Targets– are being studied by PGI, and may be supported in the future as the necessary low-level software infrastructure (e.g. OpenCL) becomes more widely available • Further Information– see www.pgroup.com/accelerate for a detailed specification of the PGI Accelerator model, an FAQ, and related articles and white papers • CCFF – The Common Compiler Feedback Format, is described at www.pgroup.com/resources/ccff.htm