600 likes | 765 Views
Introduction to MPI, OpenMP, Threads. Gyan Bhanot gyan@ias.edu , gyan@us.ibm.com. IAS Course 10/12/04 and 10/13/04. Download tar file from clustermgr.csb.ias.edu: ~gyan/course/all.tar.gz Has many MPI codes + .doc files with information on optimization and parallelization for the IAS cluster.
E N D
Introduction to MPI, OpenMP, Threads Gyan Bhanot gyan@ias.edu, gyan@us.ibm.com IAS Course 10/12/04 and 10/13/04
Download tar file from clustermgr.csb.ias.edu: ~gyan/course/all.tar.gzHas many MPI codes + .doc files with information on optimization and parallelization for the IAS cluster
P655 Cluster Type: qcpu to get machine specs
IAS Cluster Characteristics (qcpu,pmcycles) IBM P655 cluster Each node has it's own copy of AIX – which is IBM’s Unix OS Clustermgr: 2 CPU PWR4, 64KB L1 Inst Cache , 32 KB L1 Data Cache, 128 B L1 Data Cache Line Size 1536 KB L2 Cache, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4 , freq = 1200 MHz node1 to node6: 8 CPUs/node, PWR4 P655, 64 KB L1 Inst Cache, 32 KB L1 Data Cache, 128 B Data Cache Line Size, 1536 KB L2 Cache Size, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4, freq = 1500 MHz Distributed-memory architecture, shared-memory within each node. Shared file-system : GPFS, Lots of disk space. Run pingpong tests to determine Latency and Bandwidth
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
/*----------------------*//* Parallel hello world *//*----------------------*/#include <stdio.h>#include <math.h>#include <mpi.h>Int main(int argc, char * argv[]){ int taskid, ntasks; double pi; /*------------------------------------*/ /* establish the parallel environment */ /*------------------------------------*/ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); /*------------------------------------*/ /* say hello from each MPI task */ /*------------------------------------*/ printf("Hello from task %d.\n", taskid); if (taskid == 0) pi = 4.0*atan(1.0); else pi = 0.0; /*------------------------------------*/ /* do a broadcast from node 0 to all */ /*------------------------------------*/ MPI_Bcast(&pi, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); printf("node %d: pi = %.10lf\n", taskid, pi); MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return(0);}
OUTPUT FROM hello.c on 4 processors Hello from task 0.node 0: pi = 3.1415926536Hello from task 1.Hello from task 2.Hello from task 3.node 1: pi = 3.1415926536node 2: pi = 3.1415926536node 3: pi = 3.1415926536 1. Why is the order messed up? 2. What would you do to fix it?
Answer: • The control flow on different processors is not ordered – they all run their • own copy of the executable independently. Thus, when each writes output • it does so independently of the others – which makes the output • unordered. • 2. To fix it : • export MP_STDOUTMODE=ordered • Then the output will look like the following: Hello from task 0.node 0: pi = 3.1415926536Hello from task 1.node 1: pi = 3.1415926536Hello from task 2.node 2: pi = 3.1415926536Hello from task 3.node 3: pi = 3.1415926536
Pingpong Code on 4 procs of P655 cluster • /* This program times blocking send/receives, and reports the *//* latency and bandwidth of the communication system. It is *//* designed to run with an even number of mpi tasks.
msglen = 32000 bytes, elapsed time = 0.3494 msecmsglen = 40000 bytes, elapsed time = 0.4000 msecmsglen = 48000 bytes, elapsed time = 0.4346 msecmsglen = 56000 bytes, elapsed time = 0.4490 msecmsglen = 64000 bytes, elapsed time = 0.5072 msecmsglen = 72000 bytes, elapsed time = 0.5504 msecmsglen = 80000 bytes, elapsed time = 0.5503 msecmsglen = 100000 bytes, elapsed time = 0.6499 msecmsglen = 120000 bytes, elapsed time = 0.7484 msecmsglen = 140000 bytes, elapsed time = 0.8392 msecmsglen = 160000 bytes, elapsed time = 0.9485 msecmsglen = 240000 bytes, elapsed time = 1.2639 msecmsglen = 320000 bytes, elapsed time = 1.5975 msecmsglen = 400000 bytes, elapsed time = 1.9967 msecmsglen = 480000 bytes, elapsed time = 2.3739 msecmsglen = 560000 bytes, elapsed time = 2.7295 msecmsglen = 640000 bytes, elapsed time = 3.0754 msecmsglen = 720000 bytes, elapsed time = 3.4746 msecmsglen = 800000 bytes, elapsed time = 3.7441 msecmsglen = 1000000 bytes, elapsed time = 4.6994 mseclatency = 50.0 microsecondsbandwidth = 212.79 MBytes/sec(approximate values for MPI_Isend/MPI_Irecv/MPI_Waitall)3. How do you find the Bandwidth and Latency from this data?
5. Monte Carlo to Compute π Main Idea • Consider unit Square with Embedded Circle • Generate Random Points inside Square • Out of N trials, m points are inside circle • Then π ~ 4m/N • Error ~ 1/N • Simple to Parallelize
Moldeling Method: THROW MANY DARTS FRACTION INSIDE CIRCLE = π/4 1 0 1 0
#include <stdio.h>#include <math.h>#include <mpi.h>#include "MersenneTwister.h"void mcpi(int, int, int);int monte_carlo(int, int);//=========================================// Main Routine//=========================================int main(int argc, char * argv[]){ int ntasks, taskid, nworkers; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); if (taskid == 0) { printf(" #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s\n");} /*--------------------------------------------------*/ /* do monte-carlo with a variable number of workers */ /*--------------------------------------------------*/ for (nworkers=ntasks; nworkers>=1; nworkers = nworkers/2) { mcpi(nworkers, taskid, ntasks);} MPI_Finalize(); return 0;}
//============================================================// Routine to split tasks into groups and distribute the work//============================================================void mcpi(int nworkers, int taskid, int ntasks){ MPI_Comm comm; int worker, my_hits, total_hits, my_trials; int total_trials = 6400000; double tbeg, tend, elapsed, rate; double pi_estimate, est_error, abs_error; /*---------------------------------------------*/ /* make a group consisting of just the workers */ /*---------------------------------------------*/ if (taskid < nworkers) worker = 1; else worker = 0; MPI_Comm_split(MPI_COMM_WORLD, worker, taskid, &comm);
if (worker) { /*------------------------------------------*/ /* divide the work among all of the workers */ my_trials = total_trials / nworkers; MPI_Barrier(comm); tbeg = MPI_Wtime(); /* each worker gets a unique seed, and works independently */ my_hits = monte_carlo(taskid, my_trials); /* add the hits from each worker to get total_hits */ MPI_Reduce(&my_hits, &total_hits, 1, MPI_INT, MPI_SUM, 0, comm); tend = MPI_Wtime(); elapsed = tend - tbeg; rate = 1.0e-6*double(total_trials)/elapsed; /* report the results including elapsed times and rates */ if (taskid == 0) { pi_estimate = 4.0*double(total_hits)/double(total_trials); est_error = pi_estimate/sqrt(double(total_hits)); abs_error = fabs(M_PI - pi_estimate); printf("%6d %9d %9.5lf %9.5lf %9.5lf %8.3lf %9.2lf\n", nworkers, total_trials, pi_estimate, est_error, abs_error, elapsed, rate); } } MPI_Barrier(MPI_COMM_WORLD); }
//=========================================// Monte Carlo worker routine: return hits//=========================================int monte_carlo(int taskid, int trials){ int hits = 0; int xseed, yseed; double xr, yr; xseed = 1 * (taskid + 1); yseed = 1357 * (taskid + 1); MTRand xrandom( xseed ); MTRand yrandom( yseed ); for (int t=0; t<trials; t++) { xr = xrandom(); yr = yrandom(); if ( (xr*xr + yr*yr) < 1.0 ) hits++; } return hits; }
Run code in ~gyan/course/src/mpi/pi Poe pi –procs 4 –hfile hf using one node many processors #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s Speedup 4 6400000 3.14130 0.00140 0.00029 0.134 47.77 3.98 2 6400000 3.14144 0.00140 0.00016 0.267 23.96 1.997 1 6400000 3.14187 0.00140 0.00027 0.533 12.00 1.0