180 likes | 385 Views
First experiences with Porting COSMO code to GPU using F2C-ACC Fortran to CUDA compiler. Cristiano Padrin (CASPUR) Piero Lanucara (CASPUR)– Alessandro Cheloni (CNMCA). 1. The GPU explosion. A huge amount of computing power: exponential growth with respect to “standard” multicore CPUs. 2.
E N D
First experiences with Porting COSMO code to GPU using F2C-ACC Fortran to CUDA compiler CristianoPadrin (CASPUR) Piero Lanucara (CASPUR)– Alessandro Cheloni (CNMCA) 1
The GPU explosion • A huge amount of computing power: exponential growth with respect to “standard” multicore CPUs 2
785 MFlops/W 14.3 Tflops Peak 10.1 Tflops Linpack Jazz Fermi GPU Cluster at CASPUR 192 cores Intel X5650@2.67 Ghz 14336 cores on 32 Fermi C2050 QDR IB interconnect 1 TB RAM 200 TB IB storage CASPUR awarded as CUDA Research Center for 2010-2011 Jazz cluster is actually number 5 of Little Green List 3
Introduction • The problem: porting large, legacy Fortran applications on GPGPU architectures. • CUDA is the standard de-facto but only for C/C++ codes. • There is no standard yest: several GPU Fortran compilers: commercial (CAPS HMPP, PGI Accelator and CUDA Fortran), freely available (F2C-ACC), …. • Our choice: F2C-ACC (Govett) directive-based compiler from NOAA
How F2C-ACC partecipates “in make” filename.f90 F2C-ACC $(F2C) $(F2COPT) filename.f90 $(M4) filename.m4 > filename.cu $(NVCC) -c $(NVCC_OPT) -I$(INCLUDE) filename.cu filename.m4 m4 filename.cu nvcc filename.o
F2C-ACC Workflow • F2C-ACC translates Fortran code, with user added directives, in CUDA (relies on m4 library for interlanguages dependencies) • Some hand coding could be needed (see results) • Debugging and optimization Tips (e.g. Thread, block synchronization, out of memory, coalesce, occupancy....) are to be done manually Compile and linking using CUDA libraries to create an executable to run 6
Porting the Microphysics • In POMPA task 6 we are exploring “the possibilities of a simple porting of specific physics or dynamics kernels to GPUs”. • During last Workshop in Manno at CSCS two different approaches emerged to deal with the problem: one based on PGI Accelerator directives and the other one based on the F2C-ACC tool. • The study has been done on the Microphysics stand alone program optimized by Xavier Lapillonne for GPU with PGI, and refered on the HPCforge.
Reference Code Structure In microphysics program the two nested do-loop over space inside the subroutine hydci_pp has been individuated as the part to be accelerated via PGI directives. FILE mo_gscp_dwd.f90 FILE... MAIN... MODULE mo_gscp_dwd Subr. HYDCI_PP_INIT FILE... Elemental Functions Subr. HYDCI_PP MODULE... Accelerated Part via PGI dir. Subr. SATAD FILE... MODULE...
Reference Code Structure Simplified HYDCI_PP's workflow presettings 2 nested do-loop over “i and k” COMPUTING ACCELERATED PART UPDATE GLOBAL OUT “SATAD” OF SOME GLOBALS ...
Subr. HYDCI_PP_INIT Subr. HYDCI_PP Modified Code Structure • We proceeded to accelerate the same part of the code via F2C-ACC directives. • Due to current release limitations of F2C-ACC the code structure has been partly modfied, while the workflow has been leaved unchanged. • The part of the code to be accelerated remain the same but this has been extracted from hydci_pp subroutine and a file apart containing a new subroutine as been created for it: accComp.f90. FILE mo_gscp_dwd.f90 FILE accComp.f90 MODULE mo_gscp_dwd Subr. accComp Accelerated Part via F2C-ACC dir.
Subr. HYDCI_PP_INIT Subr. HYDCI_PP Modified Code Structure: why ? Major limitations have driven the changing in the code are: • Modules are (for now) not supported → necessary variables passed to the called subroutines and called subroutines/functions included into the file. • F2C-ACC “--kernel” option isn't carefully tested → elemental functions and subroutines (“satad”) inlined. FILE mo_gscp_dwd.f90 FILE accComp.f90 MODULE mo_gscp_dwd Subr. accComp Accelerated Part via F2C-ACC dir.
Subr. HYDCI_PP_INIT Subr. HYDCI_PP Modified Code Structure Host / Device View CPU GPU MODULE mo_gscp_dwd CopyIn Subr. accComp Accelerated Part via F2C-ACC dir. CopyOut
The file check.dat produced by the run of the model developed with F2C-ACC show us a better comparison with the file check.dat produced with the PGI In particualr, we can see the comparison for one iteration between the F2C-ACC version and the Fortran version: Results Comparing files … # field nt nd n_err mean R_er max R_er max A_er ( i, j, k) 1 t 1 3 8430 2.2E-16 6.2E-16 1.7E-13 ( 16, 58, 42) 8 tinc_lh 1 3 5681 2.6E-03 1.1E-01 1.1E-13 ( 13, 53, 47)
Conclusions • First results are encouraging: F2C-ACC Microphysics performances are quite good. • F2C-ACC • Directive based (incremental parallelization): readable, only one source code to mantain • “Adjustable” CUDA code is generated: portability and efficiency • F2C-ACC • Ongoing project: is an «application specific Fortran-to-CUDA compiler for performance evaluation»:momentary limited support for some advanced Fortran features (e.g. Modules) • Check for correctness: intrinsics (e.g. reduction), advanced Fermi features (e.g. FMA support) are not «automatically» driven into the F2C-ACC Compiler