180 likes | 419 Views
Jacobi solver status. Lucian Anton, Saif Mulla , Stef Salvini CCP_ASEARCH meeting October 8, 2013 Daresbury. Outline. Code structure Front end Numerical kernels Data collection Performance data Intel SB Xeon Phi BlueGeneQ GPU. Code structure. Read input from command line
E N D
Jacobi solver status Lucian Anton, SaifMulla, StefSalvini CCP_ASEARCH meeting October 8, 2013 Daresbury
Outline • Code structure • Front end • Numerical kernels • Data collection • Performance data • Intel SB • Xeon Phi • BlueGeneQ • GPU Jacobi test program
Code structure Jacobi test program • Read input from command line • Grid sizes, length of iteration block, # of iteration blocks ,.. • Algorithm to use • Output format (header, test iterations, …) • Initialize grid with an eigenvalue of Jacobi smoother • Run several iteration blocks • Collect min, max, average times.
Build model Jacobi test program Uses a generic Makefile + plaform/*.inc files F90 := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \ source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && mpiifort CC := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \ source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && icc LANG = C ifdef USE_MIC FMIC = -mmic endif ifdef USE_MPI FMPI=-DUSE_MPI endif ifdef USE_DOUBLE_PRECISION DOUBLE=-DUSE_DOUBLE_PRECISION endif ifdef USE_VEC1D VEC1D = -DUSE_VEC1D endif #FC = module add intel/comp intel/mpi && mpiifort
Command line parameters Jacobi test program arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -help Usage: [-ng<grid-size-x> <grid-size-y> <grid-size-z> ] [ -nb<block-size-x> <block-size-y> <block-size-z>][-np<num-proc-x> <num-proc-y> <num-proc-z>] [-niter <num-iterations>] [-biter <iterations-block-size>] [-malign <memory-alignment> ] [-v] [-t] [-pc] [-model <model_name> [num-waves] [threads-per-column]] [-nh] [-help] arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -model help possible values for model parameter: baseline baseline-opt blocked wave num-waves threads-per-column basegpu optgpu Note for wave model: if threads-per-column == 0 diagonal wave kernel is used.
README file Jacobi test program Full explanation on command line options are provided in README The following flags can be used to set the grid sized and other run parameters: -ng <nx> <ny> <nz> set the global gris sizes -nb <bx> <by> <bz> set the computational block size, relevant only for blocked model. Notes: 1) no sanity checks tests are done, you are on your own. 2) for blocked model the OpeNMP parallelism is done over computational blocks. One must ensure that there enough work for all threads by setting suitable block sizes.
Correctness check Jacobi test program -t flag checks if norm ratio are close to Jacobi smoother eigenvalue arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -t -niter 7 Correctness check iteration, norm ratio, deviation from eigenvalue 0 6.36918e+01 6.26966e+01 1 9.95185e-01 2.55054e-08 2 9.95185e-01 1.50473e-08 3 9.95185e-01 2.57243e-08 4 9.95185e-01 3.27436e-08 5 9.95185e-01 1.96427e-08 6 9.95185e-01 3.17978e-08 # Last norm 6.187368259733268e+01 #==========================================================================================================# # NThsNxNyNz NITER minTimemeanTimemaxTime #==========================================================================================================# 8 33 33 33 1 1.299e-04 1.487e-04 1.690e-04
Algorithms Jacobi test program • Basic 3 loops iteration over the grid • OpenMP parallelism applied to external loop • If condition from inner loop eliminated • Blocked iterations • Wave iterations
Algorithms: wave details New Old New Old Z Y Jacobi test program
Algorithms: helping vectorisation Jacobi test program The inner loop can be replace with an easier to vectorize function: // 1D loop that helps the compiler to vectorize static void vec_oneD_loop(constint n, const Real uNorth[], const Real uSouth[], const Real uWest[], const Real uEast[], const Real uBottom[], const Real uTop[], Real w[] ){ inti; #ifdef __INTEL_COMPILER #pragma ivdep #endif #ifdef __IBMC__ #pragma ibmindependent_loop #endif for (i=0; i < n; ++i) w[i] = sixth * (uNorth[i] + uSouth[i] + uWest[i] + uEast[i] + uBottom[i] + uTop[i]); }
Algorithms: CUDA Jacobi test program Base laplace3D (from Mike’s lecture notes) Shared memory in XY plane … more to come
Data collection Jacobi test program With such a large parameter space we have a big-ish data problem. Bash script + gnuplot index=0 for exe in $exe_list do for model in $model_list do for nth in $threads_list do export OMP_NUM_THREADS=$nth for ((linsize=10; linsize <= max_linsize; linsize += step)) do biter=$(((10*max_linsize)/linsize)) niter=5 if [ "$model" = wave ] then nwave="$biter $((nth<biter?nth:biter))" echo "model $model $nwave" else nwave="" fi if [ "$blk_x" -eq 0 ] ; then blk_xt=$linsize ; else blk_xt=$blk_x ; fi if [ "$blk_y" -eq 0 ] ; then blk_yt=$linsize ; else blk_yt=$blk_y ; fi if [ "$blk_z" -eq 0 ] ; then blk_zt=$linsize ; else blk_zt=$blk_z ; fi echo "./"$exe" -ng $linsize $linsize $linsize -nb $blk_xt $blk_yt $blk_zt -model $model $nwave
SandyBrige baseline Jacobi test program
SB: blocked and wave Jacobi test program
BGQ Jacobi test program
Xeon Phi vsSandyBridge Jacobi test program
Fermi data Jacobi test program
Conclusions & To do Jacobi test program • We have an integrate set of Jacobi smoother algorithms • OpenMP, CUDA, MPI(almost) • Flexible build system • Run parameters can be selected from command line and preprocessor flags • Correctness check • Scripted data collection • README file • Tested on several system (Idataplex, BGQ, Emerald,…, MacOs laptop) • GPU needs further improvements • ….