1 / 145

Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification

Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification. Organization. Overview of Usage Errors Tool Overview Automatic Error Detection with MUST Coffee Break MPI Debugging with DDT Formal Analysis with ISP Summary and Comparison

amanda
Download Presentation

Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dealing with MPI Bugs at Scale: Best Practices, Automatic Detection, Debugging, and Formal Verification

  2. Organization • Overview of Usage Errors • Tool Overview • Automatic Error Detection with MUST Coffee Break • MPI Debugging with DDT • Formal Analysis with ISP • Summary and Comparison Lunch Break • Hands-On Part 1 Coffee Break • Hands-On Part 2 • Advanced Topics

  3. Overview of Usage Errors

  4. MPI was designed to support performance • Complex standard with many operations • Includes non-blocking and collective operations • Can specify messaging choices precisely • Library not required to detect non-compliant usage • Many erroneous or unsafe actions • Incorrect arguments • Resource errors • Buffer usage • Type matching errors • Deadlock • Includes concept of “unsafe” sends

  5. Incorrect Arguments • Incorrect arguments manifest during: • Compilation (Type mismatch) • Runtime (Crash in MPI or unexpected behavior) • During porting (Only manifests for some MPIs/Systems) • Example (C): MPI_Send (buf, count, MPI_INTEGER,…);

  6. Resource Tracking Errors • Many MPI features require resource allocations • Communicators • Data types • Requests • Groups, Error Handlers, Reduction Operations • Simple “MPI_Op leak” example: MPI_Op_create (..., &op); MPI_Finalize ()

  7. Dropped and Lost Requests • Two resource errors with message requests • Leaked by creator (i.e., never completed) • Never matched by src/dest (dropped request) • Simple “lost request” example: MPI_Irecv (..., &req); MPI_Irecv (..., &req); MPI_Wait (&req,…)

  8. Buffer Usage Errors • Buffers passed to MPI_Isend, MPI_Irecv, … • Must not be written to until MPI_Wait is called • Must not be read for non-blocking receive calls • Example: MPI_Irecv (buf, ..., &request); temp = f(buf[i]); MPI_Wait (&request, ...);

  9. MPI Type Matching • Three kinds of MPI type matching • Send buffer type and MPI send data type • MPI send type and MPI receive type • MPI receive type and receive buffer type • Similar requirements for collective operations • Buffer type <=> MPI type matching • Requires compiler support • MPI_BOTTOM, MPI_LB & MPI_UB complicates • Not provided by our tools

  10. Basic MPI Type Matching Example • MPI standard provides support for heterogeneity • Endian-ness • Data formats • Limitations • Simple example code: Task 0 Task 1 MPI_Send(1, MPI_INT) MPI_Recv(8, MPI_BYTE) • Do the types match? • Buffer type <=> MPI type: Yes • MPI send type <=> MPI receive type? • NO! • Common misconception

  11. Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 1: Task 0 Task 1 MPI_Send(1, T1) MPI_Recv(1, T2) • Yes: MPI supports partial receives • Allows efficient algorithms • double <=> double; char <=> char

  12. Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 2: Task 0 Task 1 MPI_Send(1, T2) MPI_Recv(2, T1) • Yes: • double <=> double • char <=> char • double <=> double

  13. Derived MPI Type Matching Example • Consider MPI derived types corresponding to: • T1: struct {double, char} • T2: struct {double, char, double} • Do these types match? • Example 3: Task 0 Task 1 MPI_Send(2, T1) MPI_Recv(2, T2) MPI_Send(2, T2) MPI_Recv(4, T1) • No! What happens? Nothing good!

  14. Basic MPI Deadlocks • Unsafe or erroneous MPI programming practices • Code results depend on: • MPI implementation limitations • User input parameters • Classic example code: Task 0 Task 1 MPI_SendMPI_Send MPI_RecvMPI_Recv • Assume application uses “Thread funneled”

  15. Deadlock Visualization • How to visualize/analyze deadlocks? • Common approach waiting-for graphs (WFGs) • One node for each rank • Rank X waits for rank Y => node X has an arc to node Y • Consider the previous example: Task 0 Task 1 MPI_SendMPI_Send MPI_RecvMPI_Recv • Visualization: • Deadlock criterion: cycle (For simple cases) 0: MPI_Send 1: MPI_Send

  16. Deadlocks with MPI Collectives • Erroneous MPI programming practice • Simple example code: Tasks 0, 1, & 2 Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast • Possible code results: • Deadlock • Correct message matching • Incorrect message matching • Mysterious error messages • “Wait-for” every task in communicator

  17. Deadlock Visualization – Collectives • What about collectives? • Rank calling collective waits for all tasks to issue a matching call • One arc to each task that did not call a matching call • Consider the previous example: Tasks 0, 1, & 2 Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast • Visualization: • Deadlock criterion: still a cycle 2: MPI_Bcast 1: MPI_Bcast 0: MPI_Bcast 3: MPI_Barrier

  18. Deadlocks with Partial Completions • What about “any” and “some”? • MPI_Waitany/Waitsome and wild-card source receives (MPI_ANY_SOURCE) have special semantics • Wait for at least one of a set of ranks but not necessarily all • Different from “waits for all” semantic • Example: Tasks 0 Task 1 Task 2 MPI_Recv(from:1) MPI_Recv(from:ANY) MPI_Recv(from:1) • What happens: • No call can progress, Deadlock • 0 waits for 1; 1 waits for either 0 or 1; 2 waits for 1

  19. Schedule Dependent Deadlocks • What about more complex interleavings? • Non-deterministic applications • Interleaving determines what calls match or are issued • Can manifest as bugs that only occur “sometimes” • Example: Tasks 0 Task 1 Task 2 MPI_Recv(from:ANY) MPI_Send(to:1)MPI_Recv(from:2) MPI_Send(to:1) MPI_BarrierMPI_BarrierMPI_Barrier • What happens: • Case A: • Recv (from:ANY) matches send from task 0 • All calls complete • Case B: • Recv (from:ANY) matches send from task 1 • Deadlock

  20. Deadlock Visualization – Any Semantic • How to visualize partial completions? • “Waits for all of” wait type => “AND” semantic • “Waits for any of” wait type => “OR” semantic • Each type gets one type of arc • AND: solid arcs • OR: Dashed arcs • Visualization for simple example: Tasks 0 Task 1 Task 2 MPI_Recv(from:1) MPI_Recv(from:ANY) MPI_Recv(from:1) 0: MPI_Recv 1: MPI_Recv 2: MPI_Recv

  21. Deadlock Criterion • Deadlock criterionfor AND + OR • Cyclesarenecessary but not sufficient • A weakened form of a knot (OR-Knot) istheactualcriterion • Tools candetectitandvisualizethecoreofthedeadlock • Someexamples: • An OR-Knot (whichis also a knot, Deadlock): • Cycle but no OR-Knot (Not Deadlocked): • OR-Knot but not a knot (Deadlock): 0 1 2 0 1 2 0 1 2 3

  22. Avoiding Bugs • Comment your code • Confirm consistency with asserts • Consider a verbose mode of your application • Use memory checkers: lint, memcheck(valgrind) • Use unit testing, or at least provide test cases • Set up nightly builds • MPI Testing Tool: => http://www.open-mpi.org/projects/mtt/ • Ctest & Dashboards: => http://www.vtk.org/Wiki/CMake_Testing_With_CTest

  23. Tool Overview

  24. Tool Overview – Content • Approaches • Runtime Tools • Debuggers • Tutorial Tools

  25. Tools Overview – Approaches • Debuggers: • Helpful to pinpoint any error • Finding the root cause may be hard • Won’t detect sleeping errors • E.g.: gdb, TotalView, DDT • Static Analysis: • Compilers and Source analyzers • Typically: type and expression errors • E.g.: MPI-Check • Model checking: • Requires a model of your applications • State explosion possible • E.g.: MPI-Spin MPI_Recv (buf, 5, MPI_INT, -1, 123, MPI_COMM_WORLD, &status); “-1” instead of “MPI_ANY_SOURCE” if (rank == 1023) crash (); Only works with less than 1024 tasks

  26. Tools Overview – Approaches (2) • Runtime error detection: • Inspect MPI calls at runtime • Limited to the timely interleaving that is observed • Causes overhead during application run • E.g.: Intel Trace Analyzer, Umpire, Marmot, MUST Task 0 Task 1 MPI_Send(to:1, type=MPI_INT) MPI_Recv(from:0, type=MPI_FLOAT) Type mismatch Only works with less than 1024 tasks

  27. Tools Overview – Approaches (3) • Formal verification: • Extension of runtime error detection • Explores ALL possible interleavings • Detects errors that only manifest in some runs • Possibly many interleavings to explore • E.g.: ISP Task 1 Task 2 Task 0 MPI_Recv (from:ANY) MPI_Recv (from:0) MPI_Barrier() MPI_Send (to:1) MPI_Barrier () spend_some_time() MPI_Send (to:1) MPI_Barrier () Deadlock if MPI_Send(to:1)@0 matches MPI_Recv(from:ANY)@1

  28. Tools Overview – Runtime Tools Interpose Between Application and MPI Library • An MPI wrapper library intercepts all MPI calls • Checks analyze the intercepted calls • Local checks -> information from one task • Non-local checks -> information from multiple tasks • Tools can modify or time the actual MPI calls Application calls MPI_X(…) Correctness Tool calls PMPI_X(…) MPI

  29. Tools Overview – Runtime Tools, Workflow • Attach tool to target application • Link library to application • Configure tool • Enable/disable correctness checks • Enable potential integrations (e.g. with debugger) • Run application • Usually a regular mpiexec/mpirun • Non-local checks may require extra resources • Analyze correctness report • Should be available even if the application crashes • Correct bugs and rerun for verification

  30. Tools Overview – Debuggers • Debuggers allow users to take control of processes in an application: • Controlling execution • Interrupting execution – pausing • Set “breakpoints” to pause at specific line of code • Step line by line, dive into functions • Instant stop at crash locations • Examining state – data and process location • Novel ways to display processes and data to simplify scale • Used for variety of models • GPUs, PGAS, MPI, OpenMP, scalar …

  31. Tools Overview – Debuggers • Debuggers also do more than report state: • Memory debugging detect some common errors automatically: • Detect read/write beyond array bounds • Freeing a pointer twice • Memory leaks (not freeing a pointer) • Complementary to other runtime tools • Marmot, Umpire, MUST, Intel MPI Checker • DAMPI, ISP, … • Integration can pinpoint errors faster

  32. Tools Overview – Debuggers, Workflow • Does the application crash, yield incorrect output, or just hang? • Recompile the application • Add “-g” flag to compiler options • Run the application inside the debugger • Usually via graphical interface – select number of tasks, application oath, arguments – and “run” • Example: Crashing processes • Let application run until it crashes • Debugger leaps to line of code and processes that crashed • Fix the problem!

  33. Tool Overview – Tutorial Tools • Tools in this tutorial • A runtime error detection tool: MUST • A formal verification tool: ISP • A parallel debugger: DDT • Next: • Introduction to each tool • Workflow to MPI error detection with all tools

  34. Automatic Error Detection with MUST

  35. MUST – Content • Overview • Features • Infrastructure • Usage • Example • Operation Modes • Scalability

  36. MUST – Overview • MPI runtime error detection tool • Successor of the Marmot and Umpire tools • Marmot provides various local checks • Umpire provides non-local checks • First goal: merge of functionality • Second goal: improved scalability • Open source (BSD license) • Release: At this Supercomputing! • Use of infrastructure for scalability + extensibility: Checks MUST Tool Infrastructure & Generation GTI (GenericTool Infrastructure) PnMPI Base Infrastructure

  37. MUST – Features • Local checks: • Integer validation • Integrity checks (pointers valid, etc.) • Operation, Request, Communicator, Datatype, Group usage • Resource leak detection • Memory overlap checks • Non-local checks: • Collective verification • Lost message detection • Type matching (For P2P and collectives) • Deadlock detection (with root cause visualization)

  38. MUST – Infrastructure • Extra processes are used to offload checks • Communication in a tree based overlay network • Checks can be distributed Application process 0 Check Tree overlay network (TBON) 1 2 3 Extra tool process/thread

  39. MUST – Usage • Compile and link application as usual • Link against the shared version of the MPI library • Replace “mpiexec” with “mustrun” • E.g.: mustrun –np 4 myApp.exeinput.txtoutput.txt • Inspect “MUST_Output.html” in run directory • “MUST_Deadlock.dot” exists in case of deadlock • Visualize: dot –TpsMUST_Deadlock.dot –o deadlock.ps • The mustrun script will use an extra process for non-local checks (Invisible to application) • I.e.: “mustrun –np 4 …” will need resources for 5 tasks • Make sure to allocate the extra task for batch jobs

  40. MUST – Example • MPI_Init(&argc,&argv); • MPI_Comm_rank (MPI_COMM_WORLD, &rank); • MPI_Comm_size (MPI_COMM_WORLD, &size); • //1) Create a datatype • MPI_Type_contiguous (2, MPI_INT, &newType); • MPI_Type_commit (&newType); • //2) Use MPI_Sendrecv to perform a ring communication • MPI_Sendrecv ( • sBuf, 1, newType, (rank+1)%size, 123, • rBuf, sizeof(int)*2, MPI_BYTE, (rank-1+size) % size, 123, • MPI_COMM_WORLD, &status); • //3) Use MPI_Send and MPI_Recv to perform a ring communication • MPI_Send ( sBuf, 1, newType, (rank+1)%size, 456, MPI_COMM_WORLD); • MPI_Recv ( rBuf, sizeof(int)*2, MPI_BYTE, (rank-1+size) % size, 456, MPI_COMM_WORLD, &status); • MPI_Finalize ();

  41. MUST – Example (2) • Runs without any apparent issue with OpenMPI • Are there any errors? • Detect usage errors with MUST: • mpicc vihps8_2011.c –o vihps8_2011.exe • mustrun –np 4 vihps8_2011.exe • firefoxMUST_Output.html

  42. MUST – Example (3) • First error: Type missmatch

  43. MUST – Example (4) • Second error: send-send deadlock

  44. MUST – Example (5) • Visualization of deadlock (MUST_Deadlock.dot)

  45. MUST – Example (6) • Third error: Leaked datatype

  46. MUST – Operation Modes • MUST causes overhead at runtime • Default: • MUST expects a crash at any time • Blocking communication is used to ensure error detection • This can cause high overheads • If your application doesn’t crashs: • Add “--must:nocrash” to the mustrun command • Usage of aggregated non-blocking communication • Provides substantial speed up • More options: “mustrun –help”

  47. MUST – Scalability • Local checks scalable • Overhead depends on applications MPI ratio • Some local checks expensive, e.g.: buffer overlaps • Slowdown of overlap detection for Spec MPI2007:

  48. MUST – Scalability (2) • Non-Local checks currently centralized • Overhead on total number of MPI calls • Scalability: ~100 tasks • Slowdown of non-local checks for Spec MPI2007:

  49. MUST – Summary • Automatic runtime error detection tool • Open source: http://tinyurl.com/zihmust • Successor of Marmot and Umpire • Features: • Various local checks – scalable • Deadlock detection, type matching, collective verification – not yet scalable • Usage: • Replace “mpiexec” by “mustrun” • Inspect “MUST_Output.html”

  50. MPI Debugging with DDT

More Related