1.47k likes | 1.75k Views
Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification. Organization. Overview of Usage Errors Tool Overview Automatic Error Detection with MUST Coffee Break MPI Debugging with DDT Formal Analysis with ISP Summary and Comparison
E N D
Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification
Organization • Overview of Usage Errors • Tool Overview • Automatic Error Detection with MUST Coffee Break • MPI Debugging with DDT • Formal Analysis with ISP • Summary and Comparison Lunch Break • Hands-On Part 1 Coffee Break • Hands-On Part 2 • Advanced Topics
MPI was designed to support performance • Complex standard with many operations • Includes non-blocking and collective operations • Can specify messaging choices precisely • Library not required to detect non-compliant usage • Many erroneous or unsafe actions • Incorrect arguments • Resource errors • Buffer usage • Type matching errors • Deadlock • Includes concept of “unsafe” sends
Incorrect Arguments • Incorrect arguments manifest during: • Compilation (Type mismatch) • Runtime (Crash in MPI or unexpected behavior) • During porting (Only manifests for some MPIs/Systems) • Example (C): MPI_Send (buf, count, MPI_INTEGER,…);
Resource Tracking Errors • Many MPI features require resource allocations • Communicators • Data types • Requests • Groups, Error Handlers, Reduction Operations • Simple “MPI_Op leak” example: MPI_Op_create (..., &op); MPI_Finalize ()
Dropped and Lost Requests • Two resource errors with message requests • Leaked by creator (i.e., never completed) • Never matched by src/dest (dropped request) • Simple “lost request” example: MPI_Irecv (..., &req); MPI_Irecv (..., &req); MPI_Wait (&req,…)
Buffer Usage Errors • Buffers passed to MPI_Isend, MPI_Irecv, … • Must not be written to until MPI_Wait is called • Must not be read for non-blocking receive calls • Example: MPI_Irecv (buf, ..., &request); temp = f(buf[i]); MPI_Wait (&request, ...);
MPI Type Matching • Three kinds of MPI type matching • Send buffer type and MPI send data type • MPI send type and MPI receive type • MPI receive type and receive buffer type • Similar requirements for collective operations • Buffer type <=> MPI type matching • Requires compiler support • MPI_BOTTOM, MPI_LB & MPI_UB complicates • Not provided by our tools
Basic MPI Type Matching Example • MPI standard provides support for heterogeneity • Endian-ness • Data formats • Limitations • Simple example code: Task 0 Task 1 MPI_Send(1, MPI_INT) MPI_Recv(8, MPI_BYTE) • Do the types match? • Buffer type <=> MPI type: Yes • MPI send type <=> MPI receive type? • NO! • Common misconception
Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 1: Task 0 Task 1 MPI_Send(1, T1) MPI_Recv(1, T2) • Yes: MPI supports partial receives • Allows efficient algorithms • double <=> double; char <=> char
Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 2: Task 0 Task 1 MPI_Send(1, T2) MPI_Recv(2, T1) • Yes: • double <=> double • char <=> char • double <=> double
Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 3: Task 0 Task 1 MPI_Send(2, T1) MPI_Recv(2, T2) MPI_Send(2, T2) MPI_Recv(4, T1) • No! What happens? Nothing good!
Basic MPI Deadlocks • Unsafe or erroneous MPI programming practices • Code results depend on: • MPI implementation limitations • User input parameters • Classic example code: Task 0 Task 1 MPI_SendMPI_Send MPI_RecvMPI_Recv • Assume application uses “Thread funneled”
Deadlock Visualization • How to visualize/analyze deadlocks? • Common approach waiting-for graphs (WFGs) • One node for each rank • Rank X waits for rank Y => node X has an arc to node Y • Consider the previous example: Task 0 Task 1 MPI_SendMPI_Send MPI_RecvMPI_Recv • Visualization: • Deadlock criterion: cycle (For simple cases) 0: MPI_Send 1: MPI_Send
Deadlocks with MPI Collectives • Erroneous MPI programming practice • Simple example code: Tasks 0, 1, & 2 Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast • Possible code results: • Deadlock • Correct message matching • Incorrect message matching • Mysterious error messages • “Wait-for” every task in communicator
Deadlock Visualization – Collectives • What about collectives? • Rank calling collective waits for all tasks to issue a matching call • One arc to each task that did not call a matching call • Consider the previous example: Tasks 0, 1, & 2 Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast • Visualization: • Deadlock criterion: still a cycle 2: MPI_Bcast 1: MPI_Bcast 0: MPI_Bcast 3: MPI_Barrier
Deadlocks with Partial Completions • What about “any” and “some”? • MPI_Waitany/Waitsome and wild-card source receives (MPI_ANY_SOURCE) have special semantics • Wait for at least one of a set of ranks but not necessarily all • Different from “waits for all” semantic • Example: Tasks 0 Task 1 Task 2 MPI_Recv(from:1) MPI_Recv(from:ANY) MPI_Recv(from:1) • What happens: • No call can progress, Deadlock • 0 waits for 1; 1 waits for either 0 or 1; 2 waits for 1
Schedule Dependent Deadlocks • What about more complex interleavings? • Non-deterministic applications • Interleaving determines what calls match or are issued • Can manifest as bugs that only occur “sometimes” • Example: Tasks 0 Task 1 Task 2 MPI_Recv(from:ANY) MPI_Send(to:1)MPI_Recv(from:2) MPI_Send(to:1) MPI_BarrierMPI_BarrierMPI_Barrier • What happens: • Case A: • Recv (from:ANY) matches send from task 0 • All calls complete • Case B: • Recv (from:ANY) matches send from task 1 • Deadlock
Deadlock Visualization – Any Semantic • How to visualize partial completions? • “Waits for all of” wait type => “AND” semantic • “Waits for any of” wait type => “OR” semantic • Each type gets one type of arc • AND: solid arcs • OR: Dashed arcs • Visualization for simple example: Tasks 0 Task 1 Task 2 MPI_Recv(from:1) MPI_Recv(from:ANY) MPI_Recv(from:1) 0: MPI_Recv 1: MPI_Recv 2: MPI_Recv
Deadlock Criterion • Deadlock criterionfor AND + OR • Cyclesarenecessary but not sufficient • A weakened form of a knot (OR-Knot) istheactualcriterion • Tools candetectitandvisualizethecoreofthedeadlock • Someexamples: • An OR-Knot (whichis also a knot, Deadlock): • Cycle but no OR-Knot (Not Deadlocked): • OR-Knot but not a knot (Deadlock): 0 1 2 0 1 2 0 1 2 3
Avoiding Bugs • Comment your code • Confirm consistency with asserts • Consider a verbose mode of your application • Use memory checkers: lint, memcheck(valgrind) • Use unit testing, or at least provide test cases • Set up nightly builds • MPI Testing Tool: => http://www.open-mpi.org/projects/mtt/ • Ctest & Dashboards: => http://www.vtk.org/Wiki/CMake_Testing_With_CTest
Tool Overview – Content • Approaches • Runtime Tools • Debuggers • Tutorial Tools
Tools Overview – Approaches • Debuggers: • Helpful to pinpoint any error • Finding the root cause may be hard • Won’t detect sleeping errors • E.g.: gdb, TotalView, DDT • Static Analysis: • Compilers and Source analyzers • Typically: type and expression errors • E.g.: MPI-Check • Model checking: • Requires a model of your applications • State explosion possible • E.g.: MPI-Spin MPI_Recv (buf, 5, MPI_INT, -1, 123, MPI_COMM_WORLD, &status); “-1” instead of “MPI_ANY_SOURCE” if (rank == 1023) crash (); Only works with less than 1024 tasks
Tools Overview – Approaches (2) • Runtime error detection: • Inspect MPI calls at runtime • Limited to the timely interleaving that is observed • Causes overhead during application run • E.g.: Intel Trace Analyzer, Umpire, Marmot, MUST Task 0 Task 1 MPI_Send(to:1, type=MPI_INT) MPI_Recv(from:0, type=MPI_FLOAT) Type mismatch Only works with less than 1024 tasks
Tools Overview – Approaches (3) • Formal verification: • Extension of runtime error detection • Explores ALL possible interleavings • Detects errors that only manifest in some runs • Possibly many interleavings to explore • E.g.: ISP Task 1 Task 2 Task 0 MPI_Recv (from:ANY) MPI_Recv (from:0) MPI_Barrier() MPI_Send (to:1) MPI_Barrier () spend_some_time() MPI_Send (to:1) MPI_Barrier () Deadlock if MPI_Send(to:1)@0 matches MPI_Recv(from:ANY)@1
Tools Overview – Runtime Tools Interpose Between Application and MPI Library • An MPI wrapper library intercepts all MPI calls • Checks analyze the intercepted calls • Local checks -> information from one task • Non-local checks -> information from multiple tasks • Tools can modify or time the actual MPI calls Application calls MPI_X(…) Correctness Tool calls PMPI_X(…) MPI
Tools Overview – Runtime Tools, Workflow • Attach tool to target application • Link library to application • Configure tool • Enable/disable correctness checks • Enable potential integrations (e.g. with debugger) • Run application • Usually a regular mpiexec/mpirun • Non-local checks may require extra resources • Analyze correctness report • Should be available even if the application crashes • Correct bugs and rerun for verification
Tools Overview – Debuggers • Debuggers allow users to take control of processes in an application: • Controlling execution • Interrupting execution – pausing • Set “breakpoints” to pause at specific line of code • Step line by line, dive into functions • Instant stop at crash locations • Examining state – data and process location • Novel ways to display processes and data to simplify scale • Used for variety of models • GPUs, PGAS, MPI, OpenMP, scalar …
Tools Overview – Debuggers • Debuggers also do more than report state: • Memory debugging detect some common errors automatically: • Detect read/write beyond array bounds • Freeing a pointer twice • Memory leaks (not freeing a pointer) • Complementary to other runtime tools • Marmot, Umpire, MUST, Intel MPI Checker • DAMPI, ISP, … • Integration can pinpoint errors faster
Tools Overview – Debuggers, Workflow • Does the application crash, yield incorrect output, or just hang? • Recompile the application • Add “-g” flag to compiler options • Run the application inside the debugger • Usually via graphical interface – select number of tasks, application oath, arguments – and “run” • Example: Crashing processes • Let application run until it crashes • Debugger leaps to line of code and processes that crashed • Fix the problem!
Tool Overview – Tutorial Tools • Tools in this tutorial • A runtime error detection tool: MUST • A formal verification tool: ISP • A parallel debugger: DDT • Next: • Introduction to each tool • Workflow to MPI error detection with all tools
MUST – Content • Overview • Features • Infrastructure • Usage • Example • Operation Modes • Scalability
MUST – Overview • MPI runtime error detection tool • Successor of the Marmot and Umpire tools • Marmot provides various local checks • Umpire provides non-local checks • First goal: merge of functionality • Second goal: improved scalability • Open source (BSD license) • Release: At this Supercomputing! • Use of infrastructure for scalability + extensibility: Checks MUST Tool Infrastructure & Generation GTI (GenericTool Infrastructure) PnMPI Base Infrastructure
MUST – Features • Local checks: • Integer validation • Integrity checks (pointers valid, etc.) • Operation, Request, Communicator, Datatype, Group usage • Resource leak detection • Memory overlap checks • Non-local checks: • Collective verification • Lost message detection • Type matching (For P2P and collectives) • Deadlock detection (with root cause visualization)
MUST – Infrastructure • Extra processes are used to offload checks • Communication in a tree based overlay network • Checks can be distributed Application process 0 Check Tree overlay network (TBON) 1 2 3 Extra tool process/thread
MUST – Usage • Compile and link application as usual • Link against the shared version of the MPI library • Replace “mpiexec” with “mustrun” • E.g.: mustrun –np 4 myApp.exeinput.txtoutput.txt • Inspect “MUST_Output.html” in run directory • “MUST_Deadlock.dot” exists in case of deadlock • Visualize: dot –TpsMUST_Deadlock.dot –o deadlock.ps • The mustrun script will use an extra process for non-local checks (Invisible to application) • I.e.: “mustrun –np 4 …” will need resources for 5 tasks • Make sure to allocate the extra task for batch jobs
MUST – Example • MPI_Init(&argc,&argv); • MPI_Comm_rank (MPI_COMM_WORLD, &rank); • MPI_Comm_size (MPI_COMM_WORLD, &size); • //1) Create a datatype • MPI_Type_contiguous (2, MPI_INT, &newType); • MPI_Type_commit (&newType); • //2) Use MPI_Sendrecv to perform a ring communication • MPI_Sendrecv ( • sBuf, 1, newType, (rank+1)%size, 123, • rBuf, sizeof(int)*2, MPI_BYTE, (rank-1+size) % size, 123, • MPI_COMM_WORLD, &status); • //3) Use MPI_Send and MPI_Recv to perform a ring communication • MPI_Send ( sBuf, 1, newType, (rank+1)%size, 456, MPI_COMM_WORLD); • MPI_Recv ( rBuf, sizeof(int)*2, MPI_BYTE, (rank-1+size) % size, 456, MPI_COMM_WORLD, &status); • MPI_Finalize ();
MUST – Example (2) • Runs without any apparent issue with OpenMPI • Are there any errors? • Detect usage errors with MUST: • mpicc vihps8_2011.c –o vihps8_2011.exe • mustrun –np 4 vihps8_2011.exe • firefoxMUST_Output.html
MUST – Example (3) • First error: Type missmatch
MUST – Example (4) • Second error: send-send deadlock
MUST – Example (5) • Visualization of deadlock (MUST_Deadlock.dot)
MUST – Example (6) • Third error: Leaked datatype
MUST – Operation Modes • MUST causes overhead at runtime • Default: • MUST expects a crash at any time • Blocking communication is used to ensure error detection • This can cause high overheads • If your application doesn’t crashs: • Add “--must:nocrash” to the mustrun command • Usage of aggregated non-blocking communication • Provides substantial speed up • More options: “mustrun –help”
MUST – Scalability • Local checks scalable • Overhead depends on applications MPI ratio • Some local checks expensive, e.g.: buffer overlaps • Slowdown of overlap detection for Spec MPI2007:
MUST – Scalability (2) • Non-Local checks currently centralized • Overhead on total number of MPI calls • Scalability: ~100 tasks • Slowdown of non-local checks for Spec MPI2007:
MUST – Summary • Automatic runtime error detection tool • Open source: http://tinyurl.com/zihmust • Successor of Marmot and Umpire • Features: • Various local checks – scalable • Deadlock detection, type matching, collective verification – not yet scalable • Usage: • Replace “mpiexec” by “mustrun” • Inspect “MUST_Output.html”