Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification

Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification

Organization • Overview of Usage Errors • Tool Overview • Automatic Error Detection with MUST Coffee Break • MPI Debugging with DDT • Formal Analysis with ISP • Summary and Comparison Lunch Break • Hands-On Part 1 Coffee Break • Hands-On Part 2 • Advanced Topics

Overview of Usage Errors

MPI was designed to support performance • Complex standard with many operations • Includes non-blocking and collective operations • Can specify messaging choices precisely • Library not required to detect non-compliant usage • Many erroneous or unsafe actions • Incorrect arguments • Resource errors • Buffer usage • Type matching errors • Deadlock • Includes concept of “unsafe” sends

Incorrect Arguments • Incorrect arguments manifest during: • Compilation (Type mismatch) • Runtime (Crash in MPI or unexpected behavior) • During porting (Only manifests for some MPIs/Systems) • Example (C): MPI_Send (buf, count, MPI_INTEGER,…);

Resource Tracking Errors • Many MPI features require resource allocations • Communicators • Data types • Requests • Groups, Error Handlers, Reduction Operations • Simple “MPI_Op leak” example: MPI_Op_create (..., &op); MPI_Finalize ()

Dropped and Lost Requests • Two resource errors with message requests • Leaked by creator (i.e., never completed) • Never matched by src/dest (dropped request) • Simple “lost request” example: MPI_Irecv (..., &req); MPI_Irecv (..., &req); MPI_Wait (&req,…)

Buffer Usage Errors • Buffers passed to MPI_Isend, MPI_Irecv, … • Must not be written to until MPI_Wait is called • Must not be read for non-blocking receive calls • Example: MPI_Irecv (buf, ..., &request); temp = f(buf[i]); MPI_Wait (&request, ...);

MPI Type Matching • Three kinds of MPI type matching • Send buffer type and MPI send data type • MPI send type and MPI receive type • MPI receive type and receive buffer type • Similar requirements for collective operations • Buffer type <=> MPI type matching • Requires compiler support • MPI_BOTTOM, MPI_LB & MPI_UB complicates • Not provided by our tools

Basic MPI Type Matching Example • MPI standard provides support for heterogeneity • Endian-ness • Data formats • Limitations • Simple example code: Task 0 Task 1 MPI_Send(1, MPI_INT) MPI_Recv(8, MPI_BYTE) • Do the types match? • Buffer type <=> MPI type: Yes • MPI send type <=> MPI receive type? • NO! • Common misconception

Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 1: Task 0 Task 1 MPI_Send(1, T1) MPI_Recv(1, T2) • Yes: MPI supports partial receives • Allows efficient algorithms • double <=> double; char <=> char

Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 2: Task 0 Task 1 MPI_Send(1, T2) MPI_Recv(2, T1) • Yes: • double <=> double • char <=> char • double <=> double

Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 3: Task 0 Task 1 MPI_Send(2, T1) MPI_Recv(2, T2) MPI_Send(2, T2) MPI_Recv(4, T1) • No! What happens? Nothing good!

Basic MPI Deadlocks • Unsafe or erroneous MPI programming practices • Code results depend on: • MPI implementation limitations • User input parameters • Classic example code: Task 0 Task 1 MPI_SendMPI_Send MPI_RecvMPI_Recv • Assume application uses “Thread funneled”

Deadlock Visualization • How to visualize/analyze deadlocks? • Common approach waiting-for graphs (WFGs) • One node for each rank • Rank X waits for rank Y => node X has an arc to node Y • Consider the previous example: Task 0 Task 1 MPI_SendMPI_Send MPI_RecvMPI_Recv • Visualization: • Deadlock criterion: cycle (For simple cases) 0: MPI_Send 1: MPI_Send

Deadlocks with MPI Collectives • Erroneous MPI programming practice • Simple example code: Tasks 0, 1, & 2 Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast • Possible code results: • Deadlock • Correct message matching • Incorrect message matching • Mysterious error messages • “Wait-for” every task in communicator

Deadlock Visualization – Collectives • What about collectives? • Rank calling collective waits for all tasks to issue a matching call • One arc to each task that did not call a matching call • Consider the previous example: Tasks 0, 1, & 2 Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast • Visualization: • Deadlock criterion: still a cycle 2: MPI_Bcast 1: MPI_Bcast 0: MPI_Bcast 3: MPI_Barrier

Deadlocks with Partial Completions • What about “any” and “some”? • MPI_Waitany/Waitsome and wild-card source receives (MPI_ANY_SOURCE) have special semantics • Wait for at least one of a set of ranks but not necessarily all • Different from “waits for all” semantic • Example: Tasks 0 Task 1 Task 2 MPI_Recv(from:1) MPI_Recv(from:ANY) MPI_Recv(from:1) • What happens: • No call can progress, Deadlock • 0 waits for 1; 1 waits for either 0 or 1; 2 waits for 1

Schedule Dependent Deadlocks • What about more complex interleavings? • Non-deterministic applications • Interleaving determines what calls match or are issued • Can manifest as bugs that only occur “sometimes” • Example: Tasks 0 Task 1 Task 2 MPI_Recv(from:ANY) MPI_Send(to:1)MPI_Recv(from:2) MPI_Send(to:1) MPI_BarrierMPI_BarrierMPI_Barrier • What happens: • Case A: • Recv (from:ANY) matches send from task 0 • All calls complete • Case B: • Recv (from:ANY) matches send from task 1 • Deadlock

Deadlock Visualization – Any Semantic • How to visualize partial completions? • “Waits for all of” wait type => “AND” semantic • “Waits for any of” wait type => “OR” semantic • Each type gets one type of arc • AND: solid arcs • OR: Dashed arcs • Visualization for simple example: Tasks 0 Task 1 Task 2 MPI_Recv(from:1) MPI_Recv(from:ANY) MPI_Recv(from:1) 0: MPI_Recv 1: MPI_Recv 2: MPI_Recv

Deadlock Criterion • Deadlock criterionfor AND + OR • Cyclesarenecessary but not sufficient • A weakened form of a knot (OR-Knot) istheactualcriterion • Tools candetectitandvisualizethecoreofthedeadlock • Someexamples: • An OR-Knot (whichis also a knot, Deadlock): • Cycle but no OR-Knot (Not Deadlocked): • OR-Knot but not a knot (Deadlock): 0 1 2 0 1 2 0 1 2 3

Avoiding Bugs • Comment your code • Confirm consistency with asserts • Consider a verbose mode of your application • Use memory checkers: lint, memcheck(valgrind) • Use unit testing, or at least provide test cases • Set up nightly builds • MPI Testing Tool: => http://www.open-mpi.org/projects/mtt/ • Ctest & Dashboards: => http://www.vtk.org/Wiki/CMake_Testing_With_CTest

Tool Overview

Tool Overview – Content • Approaches • Runtime Tools • Debuggers • Tutorial Tools

Tools Overview – Approaches • Debuggers: • Helpful to pinpoint any error • Finding the root cause may be hard • Won’t detect sleeping errors • E.g.: gdb, TotalView, DDT • Static Analysis: • Compilers and Source analyzers • Typically: type and expression errors • E.g.: MPI-Check • Model checking: • Requires a model of your applications • State explosion possible • E.g.: MPI-Spin MPI_Recv (buf, 5, MPI_INT, -1, 123, MPI_COMM_WORLD, &status); “-1” instead of “MPI_ANY_SOURCE” if (rank == 1023) crash (); Only works with less than 1024 tasks

Tools Overview – Approaches (2) • Runtime error detection: • Inspect MPI calls at runtime • Limited to the timely interleaving that is observed • Causes overhead during application run • E.g.: Intel Trace Analyzer, Umpire, Marmot, MUST Task 0 Task 1 MPI_Send(to:1, type=MPI_INT) MPI_Recv(from:0, type=MPI_FLOAT) Type mismatch Only works with less than 1024 tasks

Tools Overview – Approaches (3) • Formal verification: • Extension of runtime error detection • Explores ALL possible interleavings • Detects errors that only manifest in some runs • Possibly many interleavings to explore • E.g.: ISP Task 1 Task 2 Task 0 MPI_Recv (from:ANY) MPI_Recv (from:0) MPI_Barrier() MPI_Send (to:1) MPI_Barrier () spend_some_time() MPI_Send (to:1) MPI_Barrier () Deadlock if MPI_Send(to:1)@0 matches MPI_Recv(from:ANY)@1

Tools Overview – Runtime Tools Interpose Between Application and MPI Library • An MPI wrapper library intercepts all MPI calls • Checks analyze the intercepted calls • Local checks -> information from one task • Non-local checks -> information from multiple tasks • Tools can modify or time the actual MPI calls Application calls MPI_X(…) Correctness Tool calls PMPI_X(…) MPI

Tools Overview – Runtime Tools, Workflow • Attach tool to target application • Link library to application • Configure tool • Enable/disable correctness checks • Enable potential integrations (e.g. with debugger) • Run application • Usually a regular mpiexec/mpirun • Non-local checks may require extra resources • Analyze correctness report • Should be available even if the application crashes • Correct bugs and rerun for verification

Tools Overview – Debuggers • Debuggers allow users to take control of processes in an application: • Controlling execution • Interrupting execution – pausing • Set “breakpoints” to pause at specific line of code • Step line by line, dive into functions • Instant stop at crash locations • Examining state – data and process location • Novel ways to display processes and data to simplify scale • Used for variety of models • GPUs, PGAS, MPI, OpenMP, scalar …

Tools Overview – Debuggers • Debuggers also do more than report state: • Memory debugging detect some common errors automatically: • Detect read/write beyond array bounds • Freeing a pointer twice • Memory leaks (not freeing a pointer) • Complementary to other runtime tools • Marmot, Umpire, MUST, Intel MPI Checker • DAMPI, ISP, … • Integration can pinpoint errors faster

Tools Overview – Debuggers, Workflow • Does the application crash, yield incorrect output, or just hang? • Recompile the application • Add “-g” flag to compiler options • Run the application inside the debugger • Usually via graphical interface – select number of tasks, application oath, arguments – and “run” • Example: Crashing processes • Let application run until it crashes • Debugger leaps to line of code and processes that crashed • Fix the problem!

Tool Overview – Tutorial Tools • Tools in this tutorial • A runtime error detection tool: MUST • A formal verification tool: ISP • A parallel debugger: DDT • Next: • Introduction to each tool • Workflow to MPI error detection with all tools

Automatic Error Detection with MUST

MUST – Content • Overview • Features • Infrastructure • Usage • Example • Operation Modes • Scalability

MUST – Overview • MPI runtime error detection tool • Successor of the Marmot and Umpire tools • Marmot provides various local checks • Umpire provides non-local checks • First goal: merge of functionality • Second goal: improved scalability • Open source (BSD license) • Release: At this Supercomputing! • Use of infrastructure for scalability + extensibility: Checks MUST Tool Infrastructure & Generation GTI (GenericTool Infrastructure) PnMPI Base Infrastructure

MUST – Features • Local checks: • Integer validation • Integrity checks (pointers valid, etc.) • Operation, Request, Communicator, Datatype, Group usage • Resource leak detection • Memory overlap checks • Non-local checks: • Collective verification • Lost message detection • Type matching (For P2P and collectives) • Deadlock detection (with root cause visualization)

MUST – Infrastructure • Extra processes are used to offload checks • Communication in a tree based overlay network • Checks can be distributed Application process 0 Check Tree overlay network (TBON) 1 2 3 Extra tool process/thread

MUST – Usage • Compile and link application as usual • Link against the shared version of the MPI library • Replace “mpiexec” with “mustrun” • E.g.: mustrun –np 4 myApp.exeinput.txtoutput.txt • Inspect “MUST_Output.html” in run directory • “MUST_Deadlock.dot” exists in case of deadlock • Visualize: dot –TpsMUST_Deadlock.dot –o deadlock.ps • The mustrun script will use an extra process for non-local checks (Invisible to application) • I.e.: “mustrun –np 4 …” will need resources for 5 tasks • Make sure to allocate the extra task for batch jobs

MUST – Example • MPI_Init(&argc,&argv); • MPI_Comm_rank (MPI_COMM_WORLD, &rank); • MPI_Comm_size (MPI_COMM_WORLD, &size); • //1) Create a datatype • MPI_Type_contiguous (2, MPI_INT, &newType); • MPI_Type_commit (&newType); • //2) Use MPI_Sendrecv to perform a ring communication • MPI_Sendrecv ( • sBuf, 1, newType, (rank+1)%size, 123, • rBuf, sizeof(int)*2, MPI_BYTE, (rank-1+size) % size, 123, • MPI_COMM_WORLD, &status); • //3) Use MPI_Send and MPI_Recv to perform a ring communication • MPI_Send ( sBuf, 1, newType, (rank+1)%size, 456, MPI_COMM_WORLD); • MPI_Recv ( rBuf, sizeof(int)*2, MPI_BYTE, (rank-1+size) % size, 456, MPI_COMM_WORLD, &status); • MPI_Finalize ();

MUST – Example (2) • Runs without any apparent issue with OpenMPI • Are there any errors? • Detect usage errors with MUST: • mpicc vihps8_2011.c –o vihps8_2011.exe • mustrun –np 4 vihps8_2011.exe • firefoxMUST_Output.html

MUST – Example (3) • First error: Type missmatch

MUST – Example (4) • Second error: send-send deadlock

MUST – Example (5) • Visualization of deadlock (MUST_Deadlock.dot)

MUST – Example (6) • Third error: Leaked datatype

MUST – Operation Modes • MUST causes overhead at runtime • Default: • MUST expects a crash at any time • Blocking communication is used to ensure error detection • This can cause high overheads • If your application doesn’t crashs: • Add “--must:nocrash” to the mustrun command • Usage of aggregated non-blocking communication • Provides substantial speed up • More options: “mustrun –help”

MUST – Scalability • Local checks scalable • Overhead depends on applications MPI ratio • Some local checks expensive, e.g.: buffer overlaps • Slowdown of overlap detection for Spec MPI2007:

MUST – Scalability (2) • Non-Local checks currently centralized • Overhead on total number of MPI calls • Scalability: ~100 tasks • Slowdown of non-local checks for Spec MPI2007:

MUST – Summary • Automatic runtime error detection tool • Open source: http://tinyurl.com/zihmust • Successor of Marmot and Umpire • Features: • Various local checks – scalable • Deadlock detection, type matching, collective verification – not yet scalable • Usage: • Replace “mpiexec” by “mustrun” • Inspect “MUST_Output.html”

MPI Debugging with DDT

Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification