260 likes | 504 Views
Outline. Speeding up Matlab Computations Symmetric Multi-Processing with Matlab Accelerating Matlab computations with GPUs Running Matlab in distributed memory environments Using the Parallel Computing Toolbox Using the Matlab Distributed Compute Engine Server Using pMatlab
E N D
Outline Speeding up Matlab Computations • Symmetric Multi-Processing with Matlab • Accelerating Matlab computations with GPUs • Running Matlab in distributed memory environments • Using the Parallel Computing Toolbox • Using the Matlab Distributed Compute Engine Server • Using pMatlab Mixing Matlab and Fortran/C code • Compiling MEX code from C/Fortran • Turning Matlab routines into C code
Symmetric Multi-Processing By default Matlab uses all cores on a given node for operations that can be threaded Arrays and matrices • Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN • Operators: +, -, .*, ./, .\, .^, *, ^, \ (MLDIVIDE), / (MRDIVIDE) • Array operations: PROD, SUM • Array manipulation: BSXFUN, SORT Linear algebra • Matrix Analysis: DET, RCOND • Linear Equations: CHOL, INV, LDL, LINSOLVE, LU, QR • Eigenvalues and singular values: EIG, HESS, SCHUR, SVD, QZ Elementary math • Trigonometric: ATAN2, COS, CSC, HYPOT, SEC, SIN, TAN, including variants for radians, degrees, hyperbolics, and inverses. • Exponential: EXP, POW2, SQRT • Complex: ABS • Rounding and remainder: CEIL, FIX, FLOOR, MOD, REM, ROUND • LOG, LOG2, LOG10, LOG1P, EXPM1, SIGN, BITAND, BITOR, BITXOR Special Functions • ERF, ERFC, ERFCINV, ERFCX, ERFINV, GAMMA, GAMMALN Data Analysis • CONV2, FILTER, FFT and IFFT of multiple columns or long vectors, FFTN, IFFTN
Symmetric Multi-Processing To be sure you only use the resources you request, you should either request an entire node and all of the CPU’s... Or request a single cpu and specify that Matlab should only use a single thread salloc –t 60 -c 24 module load matlab srunmatlab salloc –t 60 module load matlab srunmatlab-singleCompThread
Using GPUs with Matlab Matlab can use GPUs to do calculations, provided a GPU is available on the node Matlab is running on. We can query the connected GPUs from within Matlabusing And obtain a list of GPU supported functions using salloc -t 60 --gres=gpu:1 module load matlab module load cuda matlab gpuDeviceCount() gpuDevice() methods('gpuArray')
Using GPUs with Matlab So there is a 2D FFT – but no Hilbert function... We could do the log and abs functions on the GPU as well. Distribute data to GPU H=hilb(1000); H_=gpuArray(H); Z_=fft2(H_); Z=gather(Z_); imagesc(log(abs(Z))); FFT performed on GPU Gather data from GPU onto CPU H=hilb(1000); H_=gpuArray(H); Z_=fft2(H_); imagesc(gather(log(abs(Z_)));
Using GPUs with Matlab For our example, doing the FFT on the GPU is 7x faster. (4x if you include moving the data to the GPU and back) >> H=hilb(5000); >> tic; A=gather(gpuArray(H)); toc Elapsed time is 0.161166 seconds. >> tic; A=gather(fft2(gpuArray(H))); toc Elapsed time is 0.348159 seconds. >> tic; A=fft2(H); toc Elapsed time is 1.210464 seconds.
Using GPUs with Matlab Matlab has no built in hilb() function that can run on the GPU – but we can write our own function(kernel) in cuda – and save it to hilbert.cu And compile it with nvcc to generate a Parallel Thread eXecution file __global__ void HilbertKernel( double * const out, size_tconstnumRows, size_tconstnumCols) { constintrowIdx = blockIdx.x * blockDim.x + threadIdx.x; constintcolIdx = blockIdx.y * blockDim.y + threadIdx.y; if ( rowIdx >= numRows ) return; if ( colIdx >= numCols ) return; size_tlinearIdx = rowIdx + colIdx*numRows; out[linearIdx] = 1.0 / (double)(1+rowIdx+colIdx) ; } nvcc -ptxhilbert.cu
Using GPUs with Matlab We have to initialize the kernel and specify the grid size before executing the kernel. The default for matlab kernel’s is 1 thread per block, but we could create fewer blocks that were each 10 x 10 threads. H_=gpuArray.zeros(1000); hilbert_kernel=parallel.gpu.CUDAKernel('hilbert.ptx','hilbert.cu'); hilbert_kernel.GridSize=size(H_); H_=feval(hilbert_kernel, H_, 1000,1000); Z_=fft2(H_); imagesc(gather(log(abs(Z_)))); hilbert_kernel.ThreadBlockSize=[10,10,1]; hilbert_kernel.GridSize=[100,100];
Parallel Computing Toolbox As an alternative you can also use the Parallel Computing Toolbox. This supports parallelism via MPI https://info.circ.rochester.edu/BlueHive/Software/Mathematics/matlab.html You should create a pool that is the same size as the number of processors you requested in your job submission. Matlab also sells licenses for using a Distributed Computing Server which allows for matlabpools that use more than just the local node.
Parallel Computing Toolbox • You can achieve parallelism in several ways: • parfor loops – execute for loops in parallel • smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) • pmode – interactive version of smpd • distributed arrays – very similar to gpuArrays.
Parallel Computing Toolbox • You can achieve parallelism in several ways: • parfor loops – execute for loops in parallel • smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) • pmode – interactive version of smpd • distributed arrays – very similar to gpuArrays. matlabpool(4) parforn=1:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off');imagesc(log(abs(Z))); print('-dpdf','-r300',sprintf('%s%03d%s','fig1-batch_',n,'.pdf')); end matlabpool close
Parallel Computing Toolbox • You can achieve parallelism in several ways: • parfor loops – execute for loops in parallel • smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) • pmode – interactive version of smpd • distributed arrays – very similar to gpuArrays. matlabpool(4) spmd for n=drange(1:100) H=hilb(n); Z=fft2(H); f=figure('Visible','off');imagesc(log(abs(Z))); end end matlabpool close matlabpool(4) spmd for n=labindex:numlabs:100 H=hilb(n); Z=fft2(H); f=figure('Visible','off');imagesc(log(abs(Z))); end end matlabpool close
Parallel Computing Toolbox • You can achieve parallelism in several ways: • parfor loops – execute for loops in parallel • smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) • pmode – interactive version of smpd • distributed arrays – very similar to gpuArrays. pmode start 4 pmodelab2client H 3 H3 H3 pmode close n=labindex; H=hilb(n); Z=fft2(H); f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300',sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));
Parallel Computing Toolbox • You can achieve parallelism in several ways: • parfor loops – execute for loops in parallel • smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’) • pmode – interactive version of smpd • distributed arrays – very similar to gpuArrays. Example using distributed arrays matlabpool(8) H=hilb(1000); H_=distributed(H); Z_=fft(fft(H_,[],1),[],2); Z=gather(Z_); imagesc(log(abs(Z))); matlabpool close Example using gpuArray H=hilb(1000); H_=gpuArray(H); Z_=fft2(H_); Z=gather(Z_); imagesc(log(abs(Z)));
Parallel Computing Toolbox What about building hilbert matrix in parallel? matlabpool(4) spmd codist=codistributor1d(1,[250,250,250,250],[1000,1000]); [i_lo, i_hi]=codist.globalIndices(1); H_local=zeros(250,1000); for i=i_lo:i_hi for j=1:1000 H_local(i-i_lo+1,j)=1/(i+j-1); end end H_ = codistributed.build(H_local, codist); end Z_=fft(fft(H_,[],1),[],2); Z=gather(Z_); imagesc(log(abs(Z))); matlabpool close Define partition Get local indices in x-direction Initialize local array with Hilbert values. Allocate space for local part Assemble codistributed array Now it's distributed like before!
Using pMatlab • pMatlab is an alternative method to get distributed matlab functionality without relying on Matlab’s Distributed Computing Server. • It is built on top of MapMPI (an MPI implementation for matlab – written in matlab - that uses file I/O for communication) • It supports various operations on distributed arrays (up to 4D) • Remapping, aggregating, finding non-zero entries, transposing, ghosting • Elementary math functions (trig, exponential, complex, remainder/rounding) • 2D Convolutions, FFTs, Discrete Cosine Transform • FFT's need to be properly mapped (cannot be distributed along transform dimension). • It does not have as much functionality as the parallel computing toolbox – but it does support ghosting and more flexible partitioning!
Using pMatlab Since pMatlab works by launching other Matlab instances – we need them to startup with pMatlab functionality. To do so we need to add a few lines to our startup.m file in our matlab path. addpath('/software/pMatlab/MatlabMPI/src'); addpath('/software/pMatlab/src'); rehash; pMatlabGlobalsInit;
Running pMatlab in Batch Mode To submit a job in batch mode we need to create a batch script And a Matlab script to launch the pMatlab script #SBATCH -N Matlab #SBATCH -p standard #SBATCH –t 60 #SBATCH –N 2 #SBATCH --ntasks-per-node=8 module load matlab matlab -nodisplay -r "pmatlab_launcher" sample_script.pbs pmatlab_launcher.m nProcs=getenv('SLURM_NTASKS); [sreturn, machines]=system('nodelist'); machines=regexp(machines, '\n', 'split'); machines=machines(1:size(machines,2)-1); eval(pRUN('pmatlab_script',nProcs,machines));
Running pMatlab in Batch Mode And finally we have our pmatlab script. Do y-fftand do x-fft. Z_ has different map Allocate and populate local portion of array with Hilbert matrix values Copy local values into distributed array Collect resulting matrix onto 'leader' Indices for local portion of array map for distributing array Distributed matrix constructor Xmap=map([Np 1],{},0:Np-1); H_=zeros(1000,1000,Xmap); [I1,I2]=global_block_range(H_); H_local=zeros(I1(2)-I1(1)+1,I2(2)-I2(1)+1); for i=I1(1):I1(2) for j=I2(1):I2(2) H_local(i-I1(1)+1,j-I2(1)+1)=1/(i+j-1); end end H_=put_local(H_,H_local); Z_=fft(fft(H_,[],2),[],1); Z=agg(Z_); if (pMATLAB.my_rank == pMATLAB.leader) f=figure('Visible','off'); imagesc(log(abs(Z))); print('-dpdf','-r300', 'fig1.pdf'); end Plot result from 'leader' matlab process X = put_local(X, fft(local(X),[],2)); Z = transpose_grid(X); Z = put_local(Z, fft(local(Z),[],1)); pmatlab_script.m
Compiling Mex Code C, C++, or Fortran routines can be called from within Matlab. subroutine compute(h, n) integer :: n real(8) :: h(n,n) integer :: i,j do i=1,n do j=1,n h(i,j)=1d0/(i+j-1d0) end do end do end subroutine compute #include "fintrf.h" subroutine mexfunction(nlhs, plhs, nrhs, prhs) mwPointer :: plhs(*), prhs(*) integer :: nlhs, nrhs mwPointer :: mxGetPr mwPointer :: mxCreateDoubleMatrix real(8) :: mxGetScalar mwPointer :: pr_out integer :: n n = nint(mxGetScalar(prhs(1))) plhs(1) = mxCreateDoubleMatrix(n,n, 0) pr_out = mxGetPr(plhs(1)) call compute(%VAL(pr_out),n) end subroutine mexfunction mexhilbert.F90 >> H=hilbert(10)
Turning Matlab code into C First we create a log_abs_fft_hilb.m function And then we run This will produce a mex file that we can test. We could have specified the type of 'n' in our matlab function function result = log_abs_fft_hilb(n) result=log(abs(fft2(hilb(n)))); >> codegenlog_abs_fft_hilb.m –args {uint32(0)} >> A=log_abs_fft_hilb_mex(uint32(16));>> B=log_abs_fft_hilb(16); >> max(max(abs(A-B))) ans = 8.8818e-16 function result = log_abs_fft_hilb(n) assert(isa(n,'uint32')); result=log(abs(fft2(hilb(n))));
Turning Matlab code into C Now we can also export a static library that we can link to: This will create a subdirectory codegen/lib/log_abs_fft_hilb that will have the source files '.c and .h' as well as a compiled object files '.o' and a library 'log_abs_fft_hilb.a' The source files are portable to any platform with a 'C' compiler (ieBlueStreak). We can rebuild the library on BlueStreak by running >> codegenlog_abs_fft_hilb.m -configcoder.config('lib') -args {'uint32(0)'} mpixlc –c *.c arrcslog_abs_fft_hilb.a *.o
Turning Matlab code into C To use the function, we still need to write a main subroutine that links to it. This requires working with matlab's variable types (which include dynamically resizable arrays) Return value of Matlab function Output result in column major order Free up memory associated with return array Argument to Matlab function Initialize Matlab array to have rank 2 Call matlab function Matlab type definitions #include "stdio.h" #include "rtwtypes.h" #include "log_abs_fft_hilb_types.h" void main() { uint32_T n=64; emxArray_real_T*result; int32_T i,j; emxInit_real_T(&result, 2); log_abs_fft_hilb(n, result); for(i=0;i<result->size[0];i++) { for(j=0;j<result->size[1];j++) { printf("%f",result->data[i+result->size[0]*j]); } printf("\n"); } emxFree_real_T(&result); }
Turning Matlab code into C And here is another example of calling 2D fft's on real data void main() { int32_T q0; int32_T i; int32_T n=8; emxArray_creal_T *result; emxArray_real_T *input; emxInit_creal_T(&result, 2); emxInit_real_T(&input, 2); q0 = input->size[0] * input->size[1]; input->size[0]=n; input->size[1]=n; emxEnsureCapacity((emxArray__common *)input, q0, (int32_T)sizeof(real_T)); for(j=0;j<input->size[1];j++ { for(i=0;i<input->size[0];i++) { input->data[i+input->size[0]*j]=1.0 / (real_T)(i+j+1); } } my_fft(input, result); for(i=0;i<result->size[0];i++) { for(j=0;j<result->size[1];j++) { printf("[% 10.4f,% 10.4f] ", result->data[i+result->size[0]*j].re, result->data[i+result->size[0]*j].im); } printf("\n"); } emxFree_creal_T(&result); emxFree_real_T(&input); }
Turning Matlab code into C Exported FFT's only work on vectors of length 2N Error checking is disabled in exported C code Mex code will have the same functionality as exported C code, but will also have error checking. It will warn about doing FFT's on arbitrary length vectors, etc... Always test your mex code!
Matlab code is not that different from C code #include <stdio.h> #include <math.h> #include <complex.h> #include <fftw3.h> void main() { int n=4096; inti,j; double complex temp[n][n], input[n][n]; double result[n][n]; fftw_plan p; p=fftw_plan_dft_2d(n, n, &input[0][0], &temp[0][0], FFTW_FORWARD, FFTW_ESTIMATE); for (i=0;i<n;i++){ for(j=0;j<n;j++) { input[i][j]=(double complex)(1.0/(double)(i+j+1)); } } fftw_execute(p); for (i=0;i<n;i++){ for(j=0;j<n;j++) { result[i][j]=log(cabs(temp[i][j])); } } for (i=0;i<n;i++){ for(j=0;j<n;j++) { printf("%f ",result[i][j]); } } fftw_destroy_plan(p); } Or you can write your own 'C' code that uses open source mathematical libraries (iefftw).