160 likes | 318 Views
Parallel Maximum Likelihood Fitting Using MPI. Brian Meadows, U. Cincinnati and David Aston, SLAC. What is MPI ?. “Message Passing Interface” - a standard defined for passing messages between processors (CPU’s) Communications interface to Fortran, C or C++ (maybe others)
E N D
Parallel Maximum Likelihood Fitting Using MPI Brian Meadows, U. Cincinnati and David Aston, SLAC
What is MPI ? • “Message Passing Interface” - a standard defined for passing messages between processors (CPU’s) • Communications interface to Fortran, C or C++ (maybe others) • Definitions apply across different platforms (can mix Unix, Mac, etc.) • Parallelization of code is explicit - recognized and defined by users • Memory can be • Shared between CPU’s • Distributed among CPU’s OR • A hybrid of these • Number of CPU’s allowed is not pre-defined, but is fixed in any one application • The required number of CPU’s is defined by the user at job startup and does not undergo runtime optimization. B. Meadows, U. Cincinnati
How Efficient is MPI ? • The best you can do is speed up a job by a factor equal to the number of physical CPU’s involved. • Factors limiting this • Poor synchronization between CPU’s due to unbalanced loads • Sections of code that cannot be vectorized • Signalling delays. • NOTE – it is possible to request more CPU’s than physically exist • This will produce some overhead in processing, though ! B. Meadows, U. Cincinnati
Running MPI • Run the program with mpirun <job> -np N which submits N identical jobs to the system (You can also specify IP addresses for distributed CPU’s) • The OS in each machine allocates physical CPU’s dynamically as usual. • Each job • is given an ID (0 N-1) which it can access • needs to be in an identical environment to the others • Users can use this ID to label a main job (“JOB0” for example) and the remaining “satellite” jobs. B. Meadows, U. Cincinnati
Fitting with MPI • For a fit, each job should be structured to be able to run the parts it is required to do: • Any set up (read in events, etc.) • The parts that are vectorized (e.g. its group of events or parameters). • One job needs to be identified as the main one “JOB0” and must do everything, farming out groups of events or parameters to the others. • Each satellite job must send results (“signals”) back to JOB0 when done with its group and await return “signal” from JOB0 when it must start again. B. Meadows, U. Cincinnati
How MPI Runs “Scatter-Gather” running CPU 0 CPU 0 CPU 0 CPU 0 m p i r u n CPU 1 CPU 1 Wait Wait CPU 2 CPU 2 CPU… CPU… “Start” “Scatter” “Gather” B. Meadows, U. Cincinnati
Ways to Implement MPI in Maximum Likelihood Fitting Two main alternatives: • Vectorize FCN - evaluates f(x) = -2S ln W • Vectorize MINUIT (which finds the best parameters) • Alternative A has been used in previous Babar analyses • E.g. Mixing analysis of D0 K+p- • Alternative B is reported here (done by DYAEB and tested by BTM) • An advantage of B over A is that the vectorization is implemented outside a user’s code. • Vectorizing FCN may not be efficient if an integral is computed on each call Unless the integral evaluation is also vectorized. B. Meadows, U. Cincinnati
Vectorize FCN • Log-likelihood always includes a sum: where n = number of events or bins. • Vectorize computation of sum - 2 steps (“Scatter-Gather”): • Scatter: Divide up events (or bins) among the CPU’s. Each CPU computes • Gather: Re-combine the N CPU’s: B. Meadows, U. Cincinnati
Vectorize FCN • Computation of the integral: also needs to be vectorized • This is usually a sum (over bins) so can be done in a similar way. • Main advantage of this method: • Assuming function evaluation dominates CPU cycles, your gain coefficient is close to 1.0 independent of number of CPU’s or pars. • Main dis-advantage: • It requires that the user code each application appropriately. B. Meadows, U. Cincinnati
Vectorize MINUIT • Several algorithms in MINUIT: • MIGRAD (Variable metric algorithm) • Finds local minimum and error matrix at that point • SIMPLEX (Nelder-Mead method) • Linear programming method • SEEK (MC method) • Random search – virtually obsolete • Most often used is MIGRAD – so focus on that • Is easily vectorized, but results may not be at highest efficiency B. Meadows, U. Cincinnati
One iteration in MIGRAD • Compute function and gradient at current position • Use current curvature metric to compute step: • Take (large) step: • Compute function and gradient there then (cubic) interpolate back to local minimum(may need to iterate) • If satisfactory, improve Curvature metric B. Meadows, U. Cincinnati
One iteration in MIGRAD • Most of the time is spent in computing the gradient: • Numerical evaluation of gradient requires 2 FCN calls per parameter: • Vectorize this computation in two steps (“Scatter-Gather”): • Scatter: Divide up parameters (xi) among the CPU’s. Each CPU computes • Gather: Re-combine the N CPU’s. B. Meadows, U. Cincinnati
Vectorize MIGRAD • This is less efficient the smaller the number of parameters • Works well if NPAR comparable to the number of CPU’s. Gain ~ NCPU*(NPAR + 2) / (NPAR + 2*NCPU) Max. Gain = NCPU For 105 parameters a factor 3.7 was gained with 4 CPU’s. B. Meadows, U. Cincinnati
Initialization of MPI Program FIT_Kpipi C C- Maximum likelihood fit of D -> Kpipi Dalitz plot. C Implicit none Save external fcn include 'mpif.h' MPIerr= 0 MPIrank= 0 MPIprocs= 1 MPIflag= 1 call MPI_INIT(MPIerr) ! Initialize MPI call MPI_COMM_RANK(MPI_COMM_WORLD, MPIrank, MPIerr) ! Get number of CPU’s call MPI_COMM_SIZE(MPI_COMM_WORLD, MPIprocs, MPIerr) ! Which one am I ? … call MINUIT, etc. call MPI_FINALIZE(MPIerr) B. Meadows, U. Cincinnati
Use of Scatter-Gather Mechanismin MNDERI (Fortran) C Distribute the parameters from proc 0 to everyone 33 call MPI_BCAST(X, NPAR+1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) … C Use scatter-gather mechanism to compute subset of derivatives in each process: nperproc= (NPAR-1)/MPIprocs + 1 iproc1= 1+nperproc*MPIrank iproc2= MIN(NPAR,iproc1+nperproc-1) call MPI_SCATTER(GRD, nperproc, MPI_DOUBLE_PRECISION, A GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C C Loop over variable parameters DO 60 i=iproc1,iproc2 … compute G(I) End Do C C Wait until everyone is done: call MPI_GATHER(GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, A GRD, nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C everyone but proc 0 goes back to await the next set of parameters If ( MPIrank.ne.0) GO TO 33 C … Continue computation (CPU 0 only) B. Meadows, U. Cincinnati