90 likes | 249 Views
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor. H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note-2011-052 MD (July 2011). The Problem:
E N D
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note-2011-052 MD (July 2011) BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
The Problem: SUSSIX is a FORTRAN program for the post processing of turn-by-turn BeamPositionMonitor data, which computes the frequency, amplitude, and phase of tunes and resonant lines to a high degree of precision through the use of an interpolated FFT. Analysis of such data represents a vital component of many linear and non-linear dynamics measurements. For analysis of LHC BPM data a specific version sussix4drive, run through the C steering code Drive God lin, has been implemented in the CCC by the beta-beating team. Analysis of all LHC BPMs, however, represents a major real time computational bottleneck in the control room, which has prevented truly on-line study of the BPM data. In response to this limitation an effort has been underway to decrease the real computational time, with a factor of 10 as the target, of the C and Fortran codes by parallelizing them. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
Solutions considered: Since the application is run on dedicated servers in the CCC the obvious technique is to profit from the current multi-core hardware: 24/48 cores are now typical. The first idea was to use a parallelised FFT from the NAG fsl6i2dcl library for SMP and multicore together with the intel 64-bit Fortran compiler and the intel maths kernel library recommended by NAG. As a learning exercise various NAG installation validation examples of enhanced routines were run, including multi-dimensional FFTs, and all took about the same real time, but increasing user cpu time, as the number of cores was increased on a fairly idle 16-core lxplus machine. Not surprising since the examples only take msec, comparable to the overhead to launch a new thread. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
The Sussix application calls cfft (D704 in the CERN program library) which maps onto NAG c06ecf which has not yet been enhanced. c06ecf was 10% slower than cfft on a simple test case giving the same numerical results, probably due to extra housekeeping and extra numerical controls. At the same time profiling the Sussix application (with gprof) showed that only 7.5% of the total cpu time was spent in cfft and with less than 10 msec per individual call hence one could expect little or no real-time speedup by using a parallelised version. The profile showed that 70% of the cpu time was spent in a function calcr searching for the maxima of the fourier spectra with large numbers of executions of a compact reverse inner loop over the number of turns of bpm data. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
This inverse loop over maxd, the number of LHC turns measured by an individual bpm, could not be improved. In a real case maxd is typically 1000 and this loop is executed 10 million times: double complex zp,zpp,zv zpp=zp(maxd) do np=maxd-1,1, -1 zpp=zpp*zv+zp(np) enddo It was decided to try and parallelise using, like NAG, the OPENMP implementation supported by the Intel compiler and examining the granularity revealed that the highest level of independent code execution was over the processing of individual BPM data. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
The pure FORTRAN offline version was parallelised first by adding OPENMP parallelisation directives round the main bpm loop. Each bpm data is in a separate file: !$OMP PARALLEL DO PRIVATE(n,iunit,filename,nturn) !$OMP& SHARED (isix,ntot,iana,iconv,nt1,nt2,narm,istune,etune,tunex, tuney,tunez,nsus,idam,ntwix,ir,imeth,nrc,eps,nline, lr,mr,kr,idamx,ifin,isme,iusme,inv,iinv,icf,iicf) do n=1,ntot ! Parallel loop over all bpm (typically 500) call datspe(iunit,idam,ir,nt1,nt2,nturn,imeth,narm,iana) call ordres(eps,narm,nrc,idam,n,nturn) enddo !$OMP END PARALLEL DO In addition !$OMP THREADPRIVATE directives were added for all non-shareable variables in the called subroutine trees. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
This gave good scaling up to 10 cores on a non-dedicated 16-core lxplus machine (reported at the 24th ICE section meeting of 2011) so was worth extending to the target mixed C and Fortran version to be run in the control room. The bpm data is read into memory from a single file then a bpm loop is called from C code with a different but similar OPENMP syntax to give the same scaling result: #pragma omp parallel private(i,ii,ij,kk) #pragma omp for for (i=pickstart; i<=maxcounthv ; i++){ sussix4drivenoise_(&doubleToSend[0], &tune[0], &litude[0]) #pragmaomp critical /* here I/O C-code in the loop needing sequential execution */ } The Fortran datspe and ordres call trees were unchanged. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
The OPENMP directives multi-thread the code and the threads then map onto physical CPUs in a multi-core machine. The run-time environment variable OMP_NUM_THREADS instructs OPENMP how many threads, hence cores, it can use for an execution and enables easy measurement of the scaling. Since the order of processing of individual BPMs is arbitrary the results file is post-processed by a unix sort as part of the application to give the same results as a non-parallel execution. A test case of real 1000 turn LHC BPM data, analysed to find 160 lines, was performed on a reserved 24 core machine cs-ccr-spareb7 in the CCC. A normal run of this test case takes about 50 seconds on this machine. The observed wall-time speedup of C-Fortran Sussix as a function of the number of cores (from E. Maclean) is shown on the final slide. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX
About a factor of 10 improvement in the real computation time has been realised for this test case saturating at 12 cores, probably due to memory bandwidth limits. For the study of amplitude detuning reported in CERN-ATS-Note-2011-52 the parallelized C-Fortran SUSSIX was utilised within the beta-beat GUI and the target tenfold real-time reduction was verified in practice. This technique could be of interest to other applications. BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX