570 likes | 589 Views
This talk explores the need to eliminate concurrency bugs from HPC programs, with a focus on MPI's future in petascale computing and multicore systems. It discusses pitfalls and challenges of MPI programming and highlights the potential for formal analysis methods to enhance program correctness. Supported by the School of Computing at the University of Utah.
E N D
In-Situ Model Checkingof MPI Parallel Programs Ganesh Gopalakrishnan Joint work with Salman Pervez, Michael DeLisi, Anh Vo, Sarvani Vakkalanka, Subodh Sharma, Yu Yang, Robert Palmer, Mike Kirby, Guodong Li, Geof Sawaya (http://www.cs.utah.edu/formal_verification) School of Computing University of Utah Supported by: Microsoft HPC Institutes NSF CNS 0509379
Undergraduate Population300 enrolled in computer science major110 enrolled in computer engineering majorGraduate Population64 in master's program104 in Ph.D program Overview of the U of U -- SoC TimelineDepartment of Computer Science founded 1965School of Computing created in 2000Research Expenditures2005 - $9,041,2982006 - $7,176,590Faculty 35 regular faculty5 research faculty 6 adjunct faculty Undergraduate DegreesBachelor's of Science in Computer Science Bachelor's of Science in Computer EngineeringBachelor's/Master's (BS/MS)Graduate DegreesMaster's in Computer ScienceNon-Thesis Master's in Computer ScienceMaster's in Computing- Computer Engineering- Graphics and Visualization- Information Management- RoboticsPh.D. in Computer SciencePh.D. in Computing- Computer Engineering- Graphics and Visualization- Robotics- Scientific Computing Research AreasComputer Graphics and VisualizationComputer SystemsInformation Management Natural Language Processing and Machine LearningProgram Analysis, Algorithms and Formal MethodsRoboticsScientific ComputingVLSI and Computer Architecture
Computing is at an inflection point (photo courtesy of Intel)
Context and Motivation for this talk • Computing is at an inflection point • Superscalar high-frequency CPUs unsuitable • Hundreds of “90s style” CPUs will be packaged in a socket • Many desktops will have one socket, with large in-socket DRAM • Others will employ many sockets and a memory hierarchy • Software development is the critical issue • Expected to impact processors at all levels • Multicore node for a supercomputing cluster • Cisco’s 188 Tensilica processor routing chip • Parallel computing will be practised (and needs to be taught) at all levels • Threads are recognized to be highly permissive • Need libraries (e.g. OpenMP) where annotations will direct parallelization of sequential code • Many parallel programming paradigms will be necessary • “Esoteric ideas” (transaction memory) are in fact going to be in Silicon soon • The interplay between debugging, correctness by construction, and performance • Deterministic replay going to be crucial • User scriptable schedulers will be needed • Formal methods that are shown to work in the real world are ESSENTIAL
Formal Analysis Methods • Finite state modeling and analysis • Decision procedures and constraint solving • Runtime verification methods • … (thousands of other approaches)
MPI is the de-facto standard for programming cluster machines (Image courtesy of Steve Parker, CSAFE, Utah) (BlueGene/L - Image courtesy of IBM / LLNL) Our focus: Eliminate Concurrency Bugs from HPC Programs !
What is MPI’s future vis-à-vis petascale computing and multicores arriving? • Still amazingly bright! • Largest number of apps still run under MPI • Alternatives are DARPA HPC languages • Fortress - Cray • X10 – IBM • PGAS languages • UPC, Titanium • Specialized libraries • MPI alone (as it stands) is insufficient • Fault Tolerance • Intra-socket and Inter-socket communications separated • MPI for inter- and ? (OpenMP + other libraries) for intra • Other Petascale and Multicore challenges need significant attention • Hard errors, inability to checkpoint frequently • Power consumption
What does MPI help with? Helps parallelize expression evaluation e.g., f o g o h ( x ) Compute h(x) on P1 Start g ( ) on P2 Fire-up f on P1 Use sends , receives , barriers , …. to perform communications and synchronizations in a parallel setting
MPI PitfallsProgrammer expectation: Integration of a region 1/2/2020 // Add-up integrals calculated by each process if (my_rank == 0) { total = integral; for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source, tag, MPI_COMM_WORLD, &status); total = total + integral; } } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); }
Pitfalls (contd.) Bug ! Mismatched send/recv causes deadlock p0:fr 0 p0:fr 1 p0:fr 2 p1:to 0 p2:to 0 p3:to 0 1/2/2020 // Add-up integrals calculated by each process if (my_rank == 0) { total = integral; for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source, tag, MPI_COMM_WORLD, &status); total = total + integral; } } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); }
Reasons for our interest in MPI verification • Widely felt need • MPI is used on expensive machines for critical simulations • Potential for wider impact • MPI is a success as a standard • What’s good for MPI may be good for OpenMP, Cuda, Shmem, … • Working in a less crowded but important area • Shared memory software verification is a crowded field • Verifying library based software is relatively unexplored • MPI verification is potentially far more tractable and scalable than shared-memory thread program verification
Differences betweenMPI and Shared Memory / Thread Parallelprograms • Processes with local state communicate by copying • Processes sharing global state, heap • Synchronization using locks, signals, notifies, waits • Not much dynamic process creation • PThread programs may spawn children dynamically • Control / data dependencies are well confined (often to rank variables and such). • Pervasive decoding of “data” (e.g. through heap storage). • Simple aliasing • Also aliasing relations may flow considerably across pointer chains, across procedure calls.
Some high-level features of MPI • Organized as library (API) of functions • Over 300 functions in MPI-2 • Most MPI programs use about a dozen functions • A different dozen is used across various programs
MPI programming and optimization • MPI includes Message Passing, Shared Memory, and I/O • We consider C++ MPI programs, largely focussing on msg passing • MPI programs are usually written by hand • Automated generation has been proposed and still seems attractive • Many MPI programs do evolve • Re-tuning after porting to a new cluster, etc. • Porting skews timing, exposing bugs • Correctness expectation varies • Some are throw-away programs; others are long-lasting libraries • Code correctness is our emphasis – not model fidelity
Why is MPI Program Verification hard? • The MPI library is Complex • MPI user programs bury simple bug-patterns amidst thousands of lines of code • Runtime considerations can be complex
MPI Library Complexity: Collision of features • Rendezvous mode • Blocking mode • Non-blocking mode • Reliance on system buffering • User-attached buffering • Restarts/Cancels of MPI Operations • Send • Receive • Send / Receive • Send / Receive / Replace • Broadcast • Barrier • Reduce An MPI program is an interesting (and legal) combination of elements from these spaces • Non Wildcard receives • Wildcard receives • Tag matching • Communication spaces
MPI Library Complexity: 1-sided ops • MPI has shared memory (called “one-sided”) • Nodes open shared region thru a “collective” • One process manages the region (“owner”) • Ensures serial access of the window • Within a lock/unlock, a process does puts/gets • There are more functions such as “accumulate” besides puts / gets • The puts/gets are not program-ordered !
Some MPI bug patternsAll examples are from “Dynamic Software Testing of MPI Applications with Umpire” (Vetter and de Supinski) 1/2/2020
Deadlock Pattern: Insufficient Buffering P0 P1 --- --- s(P1); s(P0); r(P1); r(P0); 1/2/2020
Deadlock Pattern: Communication Race P0 P1 P2 --- --- --- r(*); s(P0); s(P0); r(P1); OK P0 P1 P2 --- --- --- r(*); s(P0); s(P0); r(P1); NOK 1/2/2020
Deadlock Pattern: Mismatched Collectives P0 P1 --- --- Bcast; Barrier; Barrier; Bcast; 1/2/2020
Resource Leak Pattern P0 --- some_alloc_op(&handle); other_alloc_op(&handle); 1/2/2020
Runtime Considerations Does System provide Buffering? What progress engine does MPI have? How does it schedule? • Does the system provide buffering ? • If not, a rendezvous behavior is enforced ! • When does the runtime actually process events? • Whenever an MPI operation is issued • Whenever some operations that “poke” • the progress engine is issued MPI Run-time; there is no separate thread for it… 1/2/2020 if (my_rank == 0) { ... for (source = 1; source < p; source++) { MPI_Recv(..) .. } else { MPI_Send(..) }
Conventional debugging of MPI • Inspection • Difficult to carry out on MPI programs (low level notation) • Simulation Based • Run given program with manually selected inputs • Can give poor coverage in practice • Simulation with runtime heuristics to find bugs • Marmot: Timeout based deadlocks, random executions • Intel Trace Collector: Similar checks with data checking • TotalView: Better trace viewing – still no “model checking”(?) • We don’t know if any formal coverage metrics are offered
Our Approach • Will look at C MPI programs • Can’t do C++ or FORTRAN yet • Won’t build Promela / Zing models • Such an approach will not scale • Will simplify code before running • Static analysis methods are under development • Will emphasize interleaving reduction • Many interleavings are equivalent! • Need a formal basis for Partial Order Reduction • Need Formal Semantics for MPI • Need to Formulate “Independence” • Need viable model-checking approach
POR 1/2/2020 With 3 processes, the size of an interleaved state space is ps=27 Partial-order reduction explores representative sequences from each equivalence class Delays the execution of independent transitions In this example, it is possible to “get away” with 7 states (one interleaving)
Possible savings in one example P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize • These are the dependent operations • 504 interleavings without POR in this example • 2 interleavings with POR !!
Getting a handle on MPI’s formal semantics (needed to define a suitable POR algorithm) Requests Collective Context Group Communicator Point to Point Operations Collective Operations Constants MPI 1.1 API 1/2/2020
Simplified Semantics of MPI_Wait 1/2/2020 29 29
Executable Formal Specification can help validate our understanding of MPI … Visual Studio 2005 Verification Environment Phoenix Compiler MPIC IR TLA+ MPI Library Model TLA+ Prog. Model MPIC Program Model TLC Model Checker MPIC Model Checker FMICS 07 PADTAD 07 1/2/2020
Even 5-line MPI programs may confound!Hence a Litmus-test outcome calculator based on formal semantics is quite handy p0: { Irecv(rcvbuf1, from p1); Irecv(rcvbuf2, from p1); … } p1: { sendbuf1 = 6; sendbuf2 = 7; Issend(sendbuf1, to p0); Isend (sendbuf2, to p0); … } • In-order message delivery (rcvbuf1 == 6) • Can access the buffers only after a later wait / test • The second receive may complete before the first • When Issend (synch.) is posted, all that is guaranteed • is that Irecv(rcvbuf1,…) has been posted 1/2/2020
POR algorithm design basics • Two co-enabled actions must not disable each other • They must commute when enabled • Example : • Independent means: • Do not disable each other • They do commute a[ k ]-- a[ j ]++
MPI’s dependence is not static Proc P: Proc Q: Proc R: Send(to Q) Recv(from *) Some Stmt Send(to Q) 1/2/2020
Dynamic Dependence due to MPI Wildcard Communication… Proc P: Proc Q: Proc R: Send(to Q) Recv(from *) Some Stmt Send(to Q) 1/2/2020
Dependence in MPI (summary) 1/2/2020 • Sends targeting wild-card receive can disable each other ; hence they are dependent • Only co-enabled actions may be dependent • Certain MPI operations are non-blocking; hence we need to define the notion “co-enabled” carefully • One-sided memory: dependence is the traditional one (lock operations can disable each other)
Dynamic dependence situations are well handled using the DPOR algorithm (Flanagan, Godefroid, POPL 05) { BT }, { Done } Add Red Process to “Backtrack Set” This builds the “Ample set” incrementally based on observed dependencies Blue is in “Done” set Ample determined using “local” criteria Nearest Dependent Transition Looking Back Current State Next move of Red process
Simplifications PMPI calls request/permit request/permit Organization of ISP MPI Program executable Simplified MPI Program scheduler Proc 1 compile Proc n Actual MPI Library and Runtime
Example of PMPI Instrumentation MPI_Win_unlock(arg1, arg2...argN) { sendToSocket(pID, Win_unlock, arg1,...,argN); while(recvFromSocket(pID) != go-ahead) MPI_Iprobe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD...); return PMPI_Win_unlock(arg1, arg2...argN); } An innocuous Progress-Engine “Poker”
How to make DPOR work for MPI ? (I) • How to instrument? • MPI provides the PMPI mechanism • For MPI_Send, we have a PMPI_Send that does the same thing • Over-ride MPI_Send • Do instrumentation within it • Launch PMPI_Send when necessary • How to orchestrate schedule? • MPI processes communicate with scheduler through TCP sockets • MPI processes send MPI envelopes into scheduler • Scheduler lets whoever it thinks must go • Execute upto MPI_Finalize • Naturally an acyclic state space !! • Replay by restarting the MPI system • Ouch !! but wait, … the Chinese Postman to the rescue ?
How to make DPOR work for MPI ? (II) • How to not get wedged inside MPI progress engine? • Understand MPI’s progress engine • If in doubt, “poke it” through commands that are known to enter the progress engine • Some of this has been demonstrated wrt. MPI one-sided • How to deal with system resource issues? • If the system provides buffering for ‘send’, how do we schedule? • We schedule Sends as soon as they arrive • If not, then how? • We schedule Sends only as soon as the matching Receives arrive
So how well does ISP work ? 1/2/2020 • Trapezoidal integration deadlock • Found in seconds • Total 33 interleavings in 9 seconds after fix • 8.4 seconds spent restarting MPI system • Monte-carlo computation of Pi • Found three deadlocks we did not know about, in seconds • No modeling effort whatsoever • After fixing, took 3,427 interleavings taking 15.5 mins • About 15 mins restarting MPI system • For Byte-Range Locking using 1-sided • Deadlock was found by us in previous work • Found again by ISP in 62 interleavings • After fix, 11,000 interleavings… no end in sight
How to improve the performance of ISP ? 1/2/2020 • Minimize restart overhead • Maybe we don’t need to reset all data before restarting • Implemented Chinese-Postman-like tour • “Collective goto” to initial state, just before MPI_Finalize • Trapezoidal finishes in 0.3 seconds (was 9 seconds before) • Monte-carlo finishes in 63 seconds (was 15 mins)
Other ideas to improve ISP (TBD) • Eliminate computations that don’t affect control • Static analysis to remove blocks that won’t deadlock • Insert barriers to confine search • Analysis to infer concurrent cuts (incomparable clock vectors) • We believe that in a year, we will be able to run ISP on unaltered applications such as ParMetis
Related work on FV for MPI programs • Main related work is that by Siegel and Avrunin • Provide synchronous channel theorems for blocking and non-blocking MPI constructs • Deadlocks caught iff caught using synchronous channels • Provide a state-machine model for MPI calls • Have built a tool called MPI_Spin that uses C extensions to Promela to encode MPI state-machine • Provide a symbolic execution approach to check computational results of MPI programs • Define “Urgent Algorithm,” which is a static POR algorithm • Schedules processes in a canonical order • Schedules sends when receives posted – sync channel effect • Wildcard receives handled through over-approximation