180 likes | 333 Views
Five Common Defect Types in Parallel Computing. Prepared for Applied Parallel Computing Prof. Alan Edelman. Taiga Nakamura University of Maryland. Introduction. Debugging and testing parallel programs is hard What kinds of mistakes do programmers make?
E N D
Five Common Defect Types in Parallel Computing Prepared for Applied Parallel Computing Prof. Alan Edelman Taiga NakamuraUniversity of Maryland
Introduction • Debugging and testing parallel programs is hard • What kinds of mistakes do programmers make? • How to prevent or effectively find and fix defects? • Hypothesis: Knowing about common defects will reduce time spent debugging • Here: Five common defect types in parallel programming (from last year’s classes) • These examples are in C/MPI • Suspect similar defect types in UPC, CAF, F/MPI Your feedback is solicited (by both us and UMD)!
Defect 1: Language Usage (Valid in MPI 2.0 only. In 1.1, it had to be MPI_Init(&argc, &argv);) MPI_Finalize must be called by all processorsin every execution path • Example #include <mpi.h> #include <stdio.h> int main(int argc, char**argv) { FILE *fp; MPI_Status status; status = MPI_Init(NULL, NULL); if (status != MPI_SUCCESS) { return -1; } fp = fopen(...); if (fp == NULL) { return -1; } ... fclose(fp); MPI_Finalize(); return 0; }
Use of Language Features MPI keywords in Conjugate Gradient in C/C++ (15 students) • Advanced language features are not necessarily used • Try to understand a few, basic language features thoroughly 24 functions, 8 constants
Defect 1: Language Usage • Erroneous use of parallel language features • E.g. inconsistent data types between send and recv, usage of memory copy functions in UPC • Simple mistakes in understanding • Very common in novices • Compile-time defects can be found easily • Wrong number or type of parameters, etc. • Some defects surface only under specific conditions • e.g., number of processors, value of input, hardware/software environment • Advice: • Check unfamiliar language features carefully
Defect 2: Space Decomposition ysize may not be divisible by np 1 ysize+1 ysize+2 • Example: Game of Life • Loop boundaries must be changed(there are other approaches too) send/recv send/recv ysize send/recv xsize MPI_Comm_size(MPI_COMM_WORLD, &np); ysize /= np; /* Main loop */ /* ... */ for (y = 0; y < ysize; y++) { for (x = 0; x < xsize; x++) { c = count(buffer, xsize, ysize, x, y); /* update buffer ... */ } } /* MPI_Send, MPI_Recv */ /* Main loop */ /* … */ for (y = 0; y < ysize; y++) { for (x = 0; x < xsize; x++) { c = count(buffer, xsize, ysize, x, y); /* update buffer ... */ } }
Defect 2: Space Decomposition • Incorrect mapping between the problem space and the program memory space • Mapping in parallel version can be different from that in serial version • Array origin is different in every processor • Additional memory space for communication can complicate the mapping logic • Symptoms: • Segmentation fault (if array index is out of range) • Incorrect or slightly incorrect output
Defect 3: Side-Effects of Parallelization 1: All procs might use the same pseudo-random sequence, spoiling independence 2: Hidden serialization in rand() causes performance bottleneck • Example: Approximation of pi int np; status = MPI_Comm_size(MPI_COMM_WORLD &np); ... srand(time(NULL)); for (i=0; i<n; i+=np) { double x = rand() / (double)RAND_MAX; double y = rand() / (double)RAND_MAX; if (x*x + y*y < 1) ++k; } status = MPI_Reduce( ... MPI_SUM... ); ... return k/(double)n; srand(time(NULL)); for (i=0; i<n; i++) { double x = rand() / (double)RAND_MAX; double y = rand() / (double)RAND_MAX; if (x*x + y*y < 1) ++k; } return k/(double)n;
Defect 3: Side-Effect of Parallelization Filesystem may cause performance bottleneck if all processorsaccess the same file simultaneously (Schedule I/O carefully, or let “master” processor do all I/O) • Example: File I/O FILE *fp = fopen (….); If (fp != NULL) { while (…) { fscanf(…); } fclose(fp); }
Defect 3: Side-Effect of Parallelization • Typical parallel programs contain only a few parallel primitives • The rest of the code is made of a sequential program running in parallel • Ordinary serial constructs can cause correctness/performance defects when they are accessed in parallel contexts • Advice: • Don’t just focus on the parallel code • Check that the serial code is working on one processor, but remember that the defect may surface only in a parallel context
Defect 4: Performance • Example: load balancing myN = N / ( numProc - 1 ); If ( myRank != 0) { for (i=0; i<myN; i++) { if (…) { ++myHits; } } } MPI_Reduce(&myHits, &totalHits, 1, MPI_INT, 0, MPI_COMM_WORLD); Rank 0 “master” processor is just waiting while other“worker” processors are executing the loop
Defect 4: Performance Communication requires O(size) time (a “correct” solution takes O(1)) #1 Send → #0 Recv → #0 Send → #1 Recv#2 Send → #1 Recv → #1 Send → #2 Recv #3 Send → #2 Recv → #2 Send → #3 Recv • Example: scheduling y2 Rank #(size-1) y1 send/recv if (rank != 0) { MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); } if (rank != (size-1)) { MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); } send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize
Defect 4: Performance • Scalability problem because processors are not working in parallel • The program output itself is correct • Perfect parallelization is often difficult: need to evaluate if the execution speed is unacceptable • Symptom: sub-linear scalability, performance much less than expected (e.g, most time spent waiting), unbalanced amount of computation • Load balancing may depend on input data • Advice: • Make sure all processors are “working” in parallel • Profiling tool might help
Defect 5: Synchronization Obvious example of deadlock (can’t avoid noticing this) #0 Recv → deadlock #1 Recv → deadlock #2 Recv → deadlock • Example: deadlock y2 Rank #(size-1) y1 send/recv MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize
Defect 5: Synchronization This may work, but it cause deadlock with some implementation and parameters #0 Send → deadlock if MPI_Send is blocking #1 Send → deadlock if MPI_Send is blocking #2 Send → deadlock if MPI_Send is blocking A “correct” solution could be (1) alternate the order of send and recv, (2) use MPI_Bsend with sufficient buffer size, (3) MPI_Sendrecv, or (4) MPI_Isend/recv(see http://www.mpi-forum.org/docs/mpi-11-html/node41.html) • Example: deadlock y2 Rank #(size-1) y1 send/recv MPI_Send ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Send ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize
Defect 5: Synchronization Synchronization (e.g. MPI_Barrier) is needed at each iteration (But too many barriers can cause a performance problem) • Example: barriers y2 Rank #(size-1) y1 for (…) { MPI_Isend ((board[y1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &request); MPI_Recv ((board[y2+1], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &status); MPI_Isend ((board[y2], Xsize, MPI_CHAR, rank+1, tag, MPI_COMM_WORLD, &request); MPI_Recv ((board[y1-1], Xsize, MPI_CHAR, rank-1, tag, MPI_COMM_WORLD, &status); } send/recv send/recv y2 Rank #1 y1 send/recv y2 Rank #0 y1 xsize
Defect 5: Synchronization • Well-known defect type in parallel programming: races, deadlocks • Some defects can be very subtle • Use of asynchronous (non-blocking) communication can lead to more synchronization defects • Symptom: program hangs, incorrect/non-deterministic output • This particular example derives from insufficient understanding of the language specification • Advice: • Make sure that all communications are correctly coordinated
Summary • This is a first cut at understanding common defects in parallel programming