240 likes | 411 Views
High Performance Computing with MATLAB Kadin Tseng Scientific Computing and Visualization Group Boston University. Outline. Performance Issues Memory Access Vectorization Compiler Other Considerations Parallel MATLAB. Memory Access.
E N D
High Performance Computing with MATLAB Kadin Tseng Scientific Computing and Visualization Group Boston University
Outline Performance Issues • Memory Access • Vectorization • Compiler • Other Considerations Parallel MATLAB
Memory Access Memory access patterns often affect computational performances. Here are some effective ways to enhance performance: • Allocate array memory before using it • For-loops Ordering • Compute and save array in-place wherever possible
Allocate Array • Allocate array memory before using it. MATLAB is designed primarily as an interactive, user-friendly environment. No pre-allotment of memory is required. Often, however, array sizes are known a priori. By pre-allocating it ensures that all array elements are allocated in one single, contiguous block right from the start. n=5000; x(1) = 1; for i=2:n x(i) = 2*x(i-1); end Wallclock time = 0.0153 seconds n=5000; x = ones(n,1); x(1) = 1; for i=2:n x(i) = 2*x(i-1); end Wallclock time = 0.0002 seconds The timing data are recorded on Katana. The actual times can vary significantly depending on the processor.
For-loop Ordering • Best if inner-most for loop is for left-most index of array, etc. • For a multi-dimensional array, x(i,j), the 1D representation of the same array, x(k), inherently possesses the contiguous property n=5000; x = zeros(n); for i=1:n % rows for j=1:n % columns x(i,j) = i+(j-1)*n; end end Wallclock time = 0.88 seconds n=5000; x = zeros(n); for j=1:n % columns for i=1:n % rows x(i,j) = i+(j-1)*n; end end Wallclock time = 0.48 seconds
Compute In-place • Compute and save array in-place improves performance x = randn(10000); tic y = x.^2; toc Wallclock time = 1.23 seconds x = randn(10000); tic x = x.^2; toc Wallclock time = 0.49 seconds
Other Considerations • Use function m-file instead of script m-file whenever reasonable • Script m-file is loaded into memory and evaluate one line at a time. Subsequent uses require reloading. • Function m-file is compiled into a pseudo-code and is loaded once. Subsequent use of the function will be faster without reloading. • Avoid using virtual memory. Physical memory is much faster. • Avoid passing large matrices to a function and modifying only a handful of elements. • Use MATLAB profiler (profile) to identify “hot spots” for performance enhancement.
Vectorization • The use of for loop in MATLAB, in general, can be expensive, especially if the loop count is large or nested for-loops. • Without array allocation, for-loops are very costly. • From a performance standpoint, in general, a compact vector representation should be used in place of for-loops. Here is an example. i = 0; for t = 0:.01:10 i = i + 1; y(i) = sin(t); end Wallclock time = 0.0045 seconds t = 0:.01:10; y = sin(t); Wallclock time =0.0005seconds
Compiler A MATLAB compiler, mcc, is available. • It compiles m-files into C codes, object libraries, or stand-alone executables. • A stand-alone executable generated with mcc can run on compatible platforms without an installed MATLAB or a MATLAB license. • Many MATLAB general and toolbox licenses are available at BU. On special occasions, MATLAB access may be denied if all licenses are checked out. Running a stand-alone requires NO licenses and no waiting. • Some compiled codes may run more efficiently than m-files because they are not run in interpretive mode. • A stand-alone enables you to share it without revealing the source. http://scv.bu.edu/documentation/tutorials/MATLAB/compiler/
Is ParallelMATLAB the way to go ? • Even in the best case, can’t compete with C/Fortran with MPI/OpenMP • It is an acceptable compromise if • Converting your MATLAB code to C/Fortran requires too big of an effort and you don’t have the time or inclination to do that. • A “big” job typically takes hours, rather than days, to run on a single processor. • You strongly prefer the relative ease and efficiency in programming a research code in MATLAB. • The appropriate multiprocessing MATLAB paradigmis at your disposal.
Multiprocessing MATLAB 1MatlabMPI 2pMatlab 3SCV’s parallelMATLAB 4Distributed Computing Toolbox 5Star-P
1MatlabMPI MatlabMPI is a parallel MATLAB package developed at Lincoln Lab in Lexington, MA. • It does not require or make use of high speed interconnect for communication among cluster nodes. Instead, it relies on the network file system being visible, or shared, by all processors. With this, message passing is achieved through I/O to the file system. • It has a small basic set of utility routines that mimic those of the Message Passing Interface (MPI) in functionalities. While the MPI routines for sending and receiving messages are performed via high speed interconnect, the routines in this package accomplish the same tasks via I/O. • It is good for “embarrassingly parallel” codes that require only infrequent communications.
2pMatlab pMatlab is a parallel MATLAB package also developed at Lincoln Lab in Lexington, MA. It is built on top of MatlabMPI. • As such, it inherits all the properties of MatlabMPI. It can be thought of as providing higher-level wrapper functions to insulate the programmers from having to deal with lower-level function calls to perform parallel tasks. • It is good for embarrassingly parallel algorithms with very modest amount of communications.
3SCV’s parallel MATLAB SCV has a very simple parallel MATLAB package that is also based on the shared network file system concept as with MatlabMPI. • It is limited to most of the same restrictions as MatlabMPI. However, there are two departures: 1. There is only one batch script and two function m-files to be inserted to your code. 2. These include a barrier function to synchronize work performed on multiprocessing nodes. This is typically required for codes that contain serial and parallel sections. • It is good for embarrassingly parallel algorithms with very modest amount of communications. • Email or call Kadin if you want to use any of the above three packages. An example is given next.
SCV parallel MATLAB – Example 1 % This example demonstrates the use of multiprocessors to compute C = A + B (matrix size is N2) % Decomposition along columns; can also be decomposed along rows, or both. % C(:, range(rank)) = A(:, range(rank)) + B(:, range(rank)) % In the above, range(rank) is the range of columns as a function of the processor rank % range(rank) = rank*n+1:rank*n+n (0<=rank<=nproc-1; n=N/nproc) % For simplicity, N is assumed to be divisible by nproc N = 8; % size of global matrix A I = (1:N)’; % generate column vector A = I(:, ones(1,N))*10 + I(:, ones(1,N))’; % generate A on current (and all) process [pbegin, pend, rank, nproc] = parallel_info(N); % query for parallel info % rank (0<=rank<=nproc-1) is the current MATLAB process n = N/nproc; % distributed column size of matrix B b = I(:, ones(1,n))*10; % generate N x n matrix b (local B) c = A(:, pbegin:pend) + b % compute local c from A and local b save matrix_c; % each current dir has own individual copy of c
SCV parallel MATLAB Example 1 (cont’d) % Run barrier to synchronize all processors ierr = barrier(rank, nproc); % Finally, perform (serial) gather on c of all ranks into C on 0 if (rank == 0) C = zeros(N); % allocate C C(:,1:n) = c; % starts with c from rank 0 which is already in memory for k=1:nproc-1 i = n*k+1; % beginning location to which c will be inserted j = n*k+n; % end location fk = [‘../' num2str(k) ‘/matrix_c']; % file name of c on process k load(fk, 'c'); C(:,i:j) = c; end save(‘../matrixC’, ‘C’]); % save C to parent dir end
… parallel MATLAB Example 1 – batch script #!/bin/csh # Example SGE script for running parallel MATLAB jobs on Katana # Submit job with the command: qsub batch_sge.scv # "#$ qsub_option" is interpreted by qsub as if "qsub_option" was passed to qsub on commandline. # Set hard runtime (wallclock) limit, default is 2 hours. Format: -l h_rt=HH:MM:SS #$ -l h_rt=2:00:00 # Merge stderr into the stdout file to reduce clutter. #$ -j y # Invoke Parallel Environment for N processors. No default value, it must be specified. # For MATLAB apps, DO NOT select omp #$ -pe 1_per_node 4 # end of qsub options # By default, the script is executed in the directory from which it was submitted # with qsub. You might want to change directories before invoking mpirun ... cd $PWD # running the following script generates multiple concurrent copies of MATLAB # Use addpath in startup.m to add path to all necessary matlab m-files # batch_sge and sge_matlab should live in either $HOME/bin or $PWD sge_matlab $PWD scv_matlab_example.m
SCV parallel MATLAB Example 2 The airplane is represented with patches of quadrilateral elements and the integral formulation is discretized to yield ψis the known Neumann boundary condition. φis the unknown to be solved for.
4Distributed Computing Toolbox The Mathworks has a DCT which is a parallel MATLAB package that utilizes the cluster’s high speed interconnect for inter-processor communications. • At present, DCT is not available on SCV machines.
5StarP StarP is a parallel MATLAB product of Interactive Supercomputing, Inc. It bears some resemblance to the pMatlab package in that it enables parallel MATLAB while shielding the programmers from most of the lower level parallel programming. • Like Mathworks’ DCT, StarP is a parallel MATLAB package that utilizes high speed interconnect for inter-processor communications. • At present, this package is not available on SCV machines.
Useful SCV Info • SCV home page (http://scv.bu.edu/) • Resource Applications (https://acct.bu.edu/SCF) • Help • Web-based tutorials (http://scv.bu.edu/) (MPI, OpenMP, MATLAB, IDL, Graphics tools) • HPC consultations by appointment • Kadin Tseng (kadin@bu.edu) • Doug Sondak (sondak@bu.edu) • help@twister.bu.edu, help@cootie.bu.edu