630 likes | 646 Views
Enhance HPC program quality with MPI verification, exploring concurrency bugs and optimizing MPI programming. Collaboration with experts at the University of Utah for effective verification methodologies.
E N D
In-Situ Model Checkingof MPI Parallel Programs Ganesh Gopalakrishnan Joint work with Salman Pervez, Michael DeLisi Sarvani Vakkalanka, Subodh Sharma, Yu Yang, Robert Palmer, Mike Kirby, Guodong Li (http://www.cs.utah.edu/formal_verification) School of Computing University of Utah Supported by: Microsoft HPC Institutes NSF CNS 0509379
MPI is the de-facto standard for programming cluster machines (Image courtesy of Steve Parker, CSAFE, Utah) (BlueGene/L - Image courtesy of IBM / LLNL) Our focus: Eliminate Concurrency Bugs from HPC Programs !
Reason for our interest in MPI verification • Widely felt need • MPI is used on expensive machines for critical simulations • Potential for wider impact • MPI is a success as a standard • What’s good for MPI may be good for OpenMP, Cuda, Shmem, … • Working in a less crowded but important area • Funding in HW verification decreasing • We are still continuing two efforts: • Verifying hierarchical cache coherence protocols • Refinement of cache coherence protocol models to HW implementations • SW verification in “threading / shared memory” crowded • Whereas HPC offers LIBRARY BASED concurrent software creation as an unexplored challenge!
A highly simplistic view of MPI This view may help compare it against PThread programs, for instance Many MPI programs compute something like f o g o h ( x ) in a distributed manner (think of maps on separate data domains, and later combinations thereof) Compute h(x) on P1 Start g ( ) on P2 Fire-up f on P1 Use sends , receives , barriers , etc., to maximize computational speeds
Some high-level features of MPI • Organized as a large library (API) • Over 300 functions in MPI-2 (was 128 in MPI-1) • Most MPI programs use about a dozen • Usually a different dozen for each program
MPI programming and optimization • MPI includes Message Passing, Shared Memory, and I/O • We consider C++ MPI programs, largely focussing on msg passing • MPI programs are usually written by hand • Automated generation has been proposed and still seems attractive • Source-to-source optimizations of MPI programs attractive • Break up communications and overlap with computations (ASPHALT) • Many important MPI programs do evolve • Re-tuning after porting to a new cluster, etc. • Correctness expectation varies • Some are throw-away programs; others are long-lasting libraries • Code correctness – not Model Fidelity – is our emphasis
Why MPI is Complex: Collision of features • Rendezvous mode • Blocking mode • Non-blocking mode • Reliance on system buffering • User-attached buffering • Restarts/Cancels of MPI Operations • Send • Receive • Send / Receive • Send / Receive / Replace • Broadcast • Barrier • Reduce An MPI program is an interesting (and legal) combination of elements from these spaces • Non Wildcard receives • Wildcard receives • Tag matching • Communication spaces
Shared memory “escape” features of MPI • MPI has shared memory (called “one-sided”) • Nodes open shared region thru a “collective” • One process manages the region (“owner”) • Ensures serial access of the window • Within a lock/unlock, a process does puts/gets • There are more functions such as “accumulate” besides puts / gets • The puts/gets are not program-ordered !
A Simple Example of Msg Passing MPI Programmer expectation: Integration of a region 1/3/2020 // Add-up integrals calculated by each process if (my_rank == 0) { total = integral; for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source, tag, MPI_COMM_WORLD, &status); total = total + integral; } } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); }
A Simple Example of Msg Passing MPI Bug ! Mismatched send/recv causes deadlock p0:fr 0 p0:fr 1 p0:fr 2 p1:to 0 p2:to 0 p3:to 0 1/3/2020 // Add-up integrals calculated by each process if (my_rank == 0) { total = integral; for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source, tag, MPI_COMM_WORLD, &status); total = total + integral; } } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); }
Runtime Considerations Does System provide Buffering? What progress engine does MPI have? How does it schedule? • Does the system provide buffering ? • If not, a rendezvous behavior is enforced ! • When does the runtime actually process events? • Whenever an MPI operation is issued • Whenever some operations that “poke” • the progress engine is issued MPI Run-time; there is no separate thread for it… 1/3/2020 if (my_rank == 0) { ... for (source = 1; source < p; source++) { MPI_Recv(..) .. } else { MPI_Send(..) }
Differences betweenMPI and Shared Memory / Thread Parallelprograms • Processes with local state communicate by copying • Processes sharing global state, heap • Synchronization using locks, signals, notifies, waits • Not much dynamic process creation • PThread programs may spawn children dynamically • Control / data dependencies are well confined (often to rank variables and such). • Pervasive decoding of “data” (e.g. through heap storage). • Simple aliasing • Also aliasing relations may flow considerably across pointer chains, across procedure calls.
Conventional debugging of MPI • Inspection • Difficult to carry out on MPI programs (low level notation) • Simulation Based • Run given program with manually selected inputs • Can give poor coverage in practice • Simulation with runtime heuristics to find bugs • Marmot: Timeout based deadlocks, random executions • Intel Trace Collector: Similar checks with data checking • TotalView: Better trace viewing – still no “model checking”(?) • We don’t know if any formal coverage metrics are offered
What should one verify ? • The overall computation achieves some “f o g o h” • Symbolic execution of MPI programs may work (Siegel et.al.) • Symbolic execution has its limits • Finding out the “plumbing” of f, g, and h is non-trivial for optimized MPI programs • So why not look for reactive bugs introduced in the process of erecting the plumbings ? • A common concern: “my code hangs”: • ISends without wait / test • Assuming that system provides buffering for Sends • Wildcard receive non-determinism is unexpected • Incorrect collective semantics assumed (e.g. for barriers) • ISP currently checks for deadlocks (not all procs reach MPI_Finalize). In future, we may check local assertions.
What approaches are cost-effective ? Some candidate approaches to MPI verification • Static Analysis for violated usages of the API • Model Checking for Concurrency Bugs • Instrumentation and Trace Checking • Static Analysis to Support Model Checking • Loop transformations • Strength reduction of code • … • But, …who gives us the formal models to check !?
Our initial choices .. and consequences • Will look at C++ MPI programs • Gotta do C++ , alas ; C won’t do • Not ask user to hand-build Promela / Zing models • Do In-Situ Model Checking – run the actual code • May need to simplify code before running • OK, so complementary static analysis methods needed • LOTS of interleavings that do not matter! • Process memory is not shared! • When can we commute two actions? • Need a formal basis for Partial Order Reduction • Need Formal Semantics for MPI • Need to Formulate “Independence” • Need viable model-checking approach
POR 1/3/2020 With 3 processes, the size of an interleaved state space is ps=27 Partial-order reduction explores representative sequences from each equivalence class Delays the execution of independent transitions In this example, it is possible to “get away” with 7 states (one interleaving)
Possible savings in one example P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize • These are the dependent operations • 504 interleavings without POR in this example • 2 interleavings with POR !!
We developed formal semantics of MPI for understanding MPI and also design a POR algorithm… Requests Collective Context Group Communicator Point to Point Operations Collective Operations Constants MPI 1.1 API 1/3/2020
Simplified Semantics of MPI_Wait 1/3/2020 20 20
Executable Formal Specification can help validate our understanding of MPI … Visual Studio 2005 Verification Environment Phoenix Compiler MPIC IR TLA+ MPI Library Model TLA+ Prog. Model MPIC Program Model TLC Model Checker MPIC Model Checker FMICS 07 PADTAD 07 1/3/2020
Even 5-line MPI programs may confound!Hence a Litmus-test outcome calculator based on formal semantics is quite handy p0: { Irecv(rcvbuf1, from p1); Irecv(rcvbuf2, from p1); … } p1: { sendbuf1 = 6; sendbuf2 = 7; Issend(sendbuf1, to p0); Isend (sendbuf2, to p0); … } • In-order message delivery (rcvbuf1 == 6) • Can access the buffers only after a later wait / test • The second receive may complete before the first • When Issend (synch.) is posted, all that is guaranteed • is that Irecv(rcvbuf1,…) has been posted 1/3/2020
Alas, MPI’s dependence is not static Proc P: Proc Q: Proc R: • Dependencies may not be fully known, JUST by looking at enabled actions • Conservative Assumptions could be made (as in Siegel’s Urgent Algorithm) • The same problem exists with other “dynamic situations” • e.g. MPI_Cancel Send(to Q) Recv(from *) Some Stmt Send(to Q) 1/3/2020
Dynamic Dependence due to MPI Wildcard Communication… • Illustration of a Missed Dependency that would have been detected, had Proc R been scheduled first… Proc P: Proc Q: Proc R: Send(to Q) Recv(from *) Some Stmt Send(to Q) 1/3/2020
Dependance in MPI (partial results) 1/3/2020 • Wildcard receives and the sends targeting it are dependent • Each send potentially provides a different value to the receive • For Isend and Irecv, the dependency is induced by wait / test that help complete these operations • Barrier entry order does not matter • MPI Win_lock (owner) and Win_unlock (non-owner) • Need to characterize more MPI ops (future)
Situation is similar to that discussed in Flanagan/Godefroid POPL 05 (DPOR) a[ k ]-- a[ j ]++ • Action Dependence Determines COMMUTABILITY • (POR theory is really detailed; it is more than • commutability, but let’s pretend it is …) • Depends on j == k, in this example • Can be very difficult to determine statically • Can determine dynamically
Hence we turn to their DPOR algorithm { BT }, { Done } Add Red Process to “Backtrack Set” This builds the “Ample set” incrementally based on observed dependencies Blue is in “Done” set Ample determined using “local” criteria Nearest Dependent Transition Looking Back Current State Next move of Red process
How to make DPOR work for MPI ? (I) • How to instrument? • MPI provides the PMPI mechanism • For MPI_Send, we have a PMPI_Send that does the same thing • Over-ride MPI_Send • Do instrumentation within it • Launch PMPI_Send when necessary • How to orchestrate schedule? • MPI processes communicate with scheduler through TCP sockets • MPI processes send MPI envelopes into scheduler • Scheduler lets whoever it thinks must go • Execute upto MPI_Finalize • Naturally an acyclic state space !! • Replay by restarting the MPI system • Ouch !! but wait, … the Chinese Postman to the rescue ?
How to make DPOR work for MPI ? (II) • How to not get wedged inside MPI progress engine? • Understand MPI’s progress engine • If in doubt, “poke it” through commands that are known to enter the progress engine • Some of this has been demonstrated wrt. MPI one-sided • How to deal with system resource issues? • If the system provides buffering for ‘send’, how do we schedule? • We schedule Sends as soon as they arrive • If not, then how? • We schedule Sends only as soon as the matching Receives arrive
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Record that owner has acquired window access
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Treat non-owner’s win_lock as a no-op
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Perform Accumulate from P1
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Perform Accumulate from P0
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize P1 issues Win_unlock; Scheduler traps it; Notes that P0 has locked window
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize So scheduler records P1 to be in a “blocked” state; But we do allow P1 to launch its PMPI_Win_unlock (there is nothing else that could be done!)
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Now, P0 issues win_unlock
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize • Recall that P1’s PMPI_Win_unlock has been launched • But, P1 has not reported back to scheduler yet…
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize • To keep things simple, the scheduler works in “phases” • When Pi…Pj have been “let go” in one phase of • the scheduler, no other Pk is “let go” till Pi..Pj • have reported back.
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize • If we now allow P0’s PMPI_Win_unlock to issue, • it may zip thru the progress engine and miss P1’s • PMPI_Win_unlock • But we HAVE to allow P0 to launch, or else, P1 won’t • get access to window!
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize • So P1 will likely be stuck in the progress engine • But P0 next enters the progress engine only at Barrier • But we don’t schedule P0 till P1 has reported back • But P1 won’t report back (stuck inside progress engine)
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Deadlock inside scheduler !
Simple 1-sided Example…will show advancing computation by Blue marching P1 (non-owner of window) P0 (owner of window) 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Solution: When P0 comes to scheduler, we do not give a ‘go-ahead’ to it; so it keeps poking the progress engine; this causes P1 to come back to scheduler; then we let P0’s PMPI_Win_unlock to issue
P0’s code to handle MPI_Win_unlock(in general, this is how every MPI_SomeFunc is structured…) MPI_Win_unlock(arg1, arg2...argN) { sendToSocket(pID, Win_unlock, arg1,...,argN); while(recvFromSocket(pID) != go-ahead) MPI_Iprobe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD...); return PMPI_Win_unlock(arg1, arg2...argN); } An innocuous Progress-Engine “Poker”
Assessment of Solution to forward-progress 1/3/2020 • Solutions may be MPI-library specific • This is OK so long as we know exactly how the progress engine of the MPI library works • This needs to be advertised by MPI library designers • Better still: if they can provide more “hooks”, ISP can be made more successful