What is MPI ?

What is MPI ? • A standard defined for “Message Passing Interface” between parallel processors (CPU’s) • Communications interface to Fortran, C or C++ (maybe others) • Definitions apply across different platforms (can mix Unix, Mac, etc.) • Parallelization of code is explicit - recognized and defined by users • Memory can be • Shared between CPU’s • Distributed among CPU’s OR • A hybrid of these • Number of CPU’s allowed is not pre-defined, but is fixed in any one application • The required number of CPU’s is defined by the user at job startup and does not undergo runtime optimization. B. Meadows, Dalitz Mixing, Oct 24th, 2007

How Efficient is MPI ? • The best you can do is speed up a job by a factor equal to the number of CPU’s involved. • Factors limiting this • Poor synchronization between CPU’s due to unbalanced loads • Sections of code that cannot be vectorized • Signalling delays. B. Meadows, Dalitz Mixing, Oct 24th, 2007

Ways to Implement MPI in ML Fitting Two main alternatives: • Vectorize FCN - evaluates f(x) = -2S ln W • Vectorize MINUIT ? • Alternative A has been used in previous Babar analyses • E.g. Mixing analysis of D0 K+p- • Alternative B is reported here (done by DYAEB and tested by BTM) • An advantage of B over A is that the vectorization is implemented outside a user’s code. • Vectorizing FCN may not be efficient if an integral is computed on each call Unless the integral evaluation is also vectorized. B. Meadows, Dalitz Mixing, Oct 24th, 2007

Vectorize FCN • Log-likelihood always includes a sum: where n = number of events or bins. • Vectorize computation of sum - 2 steps (“Scatter-Gather”): • Scatter: Divide up events (or bins) among the CPU’s. Each CPU computes • Gather: Re-combine the N CPU’s: B. Meadows, Dalitz Mixing, Oct 24th, 2007

Vectorize FCN • Computation of the integral: also needs to be vectorized • Since it is also usually a sum (over bins) this can be done in a similar way. B. Meadows, Dalitz Mixing, Oct 24th, 2007

Vectorize MINUIT • Several algorithms in MINUIT: • MIGRAD (Variable metric algorithm) Finds local minimum and error matrix at that point • SIMPLEX (Nelder-Mead method) Linear programming method • SEEK (MC method) Random search – virtually obsolete • Most often used is MIGRAD – so focus on that • Is easily vectorized, but results may not be at highest efficiency B. Meadows, Dalitz Mixing, Oct 24th, 2007

Vectorize MIGRAD • WARNING: This is not very efficient when number of parameters is comparable to the number of CPU’s. Gain ~ (2*NPAR + 4) / (2*NPAR/NCPU + 4) B. Meadows, Dalitz Mixing, Oct 24th, 2007

One iteration in MIGRAD • Compute function and gradient at current position • Use current curvature metric to compute step: • Take (large) step: • Compute function and gradient there then (cubic) interpolate back to local minimum (may need to iterate) • If satisfactory, improve Curvature metric B. Meadows, Dalitz Mixing, Oct 24th, 2007

One iteration in MIGRAD • Most of the time is spent in computing the gradient: • Numerical evaluation of gradient requires 2 FCN calls per parameter: • Vectorize this computation in two steps (“Scatter-Gather”): • Scatter: Divide up parameters (xi) among the CPU’s. Each CPU computes • Gather: Re-combine the N CPU’s. B. Meadows, Dalitz Mixing, Oct 24th, 2007

Running MPI CPU 0 CPU 0 CPU 0 CPU 0 CPU 1 CPU 1 Wait CPU 2 CPU 2 CPU… CPU… “Start” “Scatter” “Gather” B. Meadows, Dalitz Mixing, Oct 24th, 2007

Running MPI • Run the program with Mpirun <job> -np N will submit identical jobs to N CPU’s (You can also specify IP addresses for these) • Each CPU must have all the data it needs to compute f(x) • So you need to structure each job to be able to run the parts you wish it to do: • Any set up (read in events, etc.) • The parts that are vectorized • BUT only the parts you want: • Make it wait for a signal when vectorized part is done. B. Meadows, Dalitz Mixing, Oct 24th, 2007

Initialization of MPI Program FIT_Kpipi C C- Maximum likelihood fit of D -> Kpipi Dalitz plot. C Implicit none Save external fcn include 'mpif.h' MPIerr= 0 MPIrank= 0 MPIprocs= 1 MPIflag= 1 call MPI_INIT(MPIerr) ! Initialize MPI call MPI_COMM_RANK(MPI_COMM_WORLD, MPIrank, MPIerr) ! Get number of CPU’s call MPI_COMM_SIZE(MPI_COMM_WORLD, MPIprocs, MPIerr) ! Which one am I ? … call MINUIT, etc. call MPI_FINALIZE(MPIerr) B. Meadows, Dalitz Mixing, Oct 24th, 2007

Use of Scatter-Gather Mechanismin MNDERI (Fortran) C Distribute the parameters from proc 0 to everyone 33 call MPI_BCAST(X, NPAR+1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) … C Use scatter-gather mechanism to compute subset of derivatives in each process: nperproc= (NPAR-1)/MPIprocs + 1 iproc1= 1+nperproc*MPIrank iproc2= MIN(NPAR,iproc1+nperproc-1) call MPI_SCATTER(GRD, nperproc, MPI_DOUBLE_PRECISION, A GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C C Loop over variable parameters DO 60 i=iproc1,iproc2 … compute G(I) End Do C C Wait until everyone is done: call MPI_GATHER(GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, A GRD, nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C everyone but proc 0 goes back to await the next set of parameters If ( MPIrank.ne.0) GO TO 33 C … Continue computation (CPU 0 only) B. Meadows, Dalitz Mixing, Oct 24th, 2007

What is MPI ?

What is MPI ?

Presentation Transcript