640 likes | 710 Views
Explore rigorous guarantees for debugging concurrent systems with formal methods. Formalize high-performance message passing behavior for MPI API. Develop dynamic analysis tools for MPI programs.
E N D
Message Passing: Formalization, Dynamic VerificationGanesh GopalakrishnanSchool of Computing, University of Utah,Salt Lake City, UT 84112, USAbased on research done by studentsSarvaniVakkalanka, Anh Vo, Michael DeLisi, Alan Humphrey, Chris Derrick, SriramAananthakrishnan, and faculty colleague Mike Kirby http:// www.cs.utah.edu / formal_verification Supported by NSF CNS 0509379 and Microsoft
Correctness Concerns Will Loom Everywhere…Debug Concurrent Systems, providing rigorous guarantees
Need for help / rigor noted by notable practitioners “Sequential programming is really hard, and parallel programming is a step beyond that” Tanenbaum, USENIX 2008 Lifetime Achievement Award talk “Formal methods provide the only truly scalable approach to developing correct code in this complex programming environment.” Rusty Lusk, in his EC2 2009 Invited Talk entitled “Slouching Towards Exascale: Programming Models for High Performance Computing”
Must Cover BOTH Types of Concurrency Shared Memory Enjoys the most attention (esp. from the CS FV community) Message Passing Formal aspects of message passing are represented by CCS, CSP, … Many practical message passing libraries exist, but without a rigorous semantics that characterizes their stand-alone behavior and/or their semantics in the context of a standard programming language (e.g. how compiler optimizations work in their presence) The time is now ripe to make progress with respect to a few important message passing libraries (e.g., MPI, MCAPI, …)
Importance of Formalizing High-performance Message Passing Behavior Fundamental to dealing with the Message Passing Interface (MPI) API MPI is VERY widely used Enables reasoning about the reactive behavior of API calls Out of order issue and completion – easily explained thru Happens-before (HB) This HB took us a long time to discover; but it is surprisingly easy to explain! Made up of MATCHES-BEFORE and COMPLETES-BEFORE Happens-before depends on available run-time resources Can help characterize compiler optimizations formally Handle new correctness-critical message-passing libraries Multi-core Communications API or MCAPI for embedded systems use (e.g. Cell-phones etc) – can be understood using VERY SIMILAR formalism Understanding / pedagogy of message-passing program behavior No need to dismiss this area as “too hairy” Enables building formal dynamic verification tools Find bugs, reveal lurking “unexpected behaviors”, …
In general, we must get better at verifying concurrent programs written against a growing number of real APIs Code written using mature libraries (MPI, OpenMP, PThreads, …) Model building and Model maintenance have HUGE costs (I would assert: “impossible in practice”) and does not ensure confidence !! API calls made from real programming languages (C, Fortran, C++) Runtime semantics determined by realistic Compilers and Runtimes
Importance of MPI Program Analysis / Debugging SiCortex 5832 processor System (Courtesy SiCortex) IBM Blue Gene (Picture Courtesy IBM) LANL’sPetascale machine “Roadrunner” (AMD Opteron CPUs and IBM PowerX Cell) • Almost the default choice for large-scale parallel simulations • Huge support base • Very mature codes exist in MPI – cannot easily be re-implemented • Performs critical simulations in Science and Engineering • Weather / Earthquake Prediction, Computational Chemistry,…Parallel Model Checking,..
Two Classes of MPI Programs Mostly Computational these are sequential programs “pulled apart” one can see higher order functions (map, …) While optimizing these programs, reactive behavior creeps in non-blocking sends overlapped with computation probing for computations finishing and initiating new work early Highly Reactive User level libraries written in MPI e.g. Adaptive Dynamic Load Balancing libraries Bottom-line : must employ suitable dynamic verification methods for MPI
Our Work We have a formal model for MPI This formal model explains succinctly the space of all standard-compliant executions of MPI What must a standard-compliant MPI library together with the support infrastructure (runtime, compilers, …) finally amount to?
Practical Contribution of Our Work We have built the only push-button dynamic analysis tool for MPI / C programs called ISP Work on MPI / Fortran in progress Runs on MAC OS/X, Windows, Linux Tested against five state-of-the-art MPI libraries MPICH2, OpenMPI, MSMPI, MVAPICH, IBM MPI (in progress) Visual-Studio and Eclipse Parallel Tools Platform integration 100s of large case studies Efficiency is decent (getting better) 15K LOC ParmetisHypergraphPartitioner analyzed for deadlocks, resource leaks, assertion violations for a given test harness in < 5 seconds for 2 MPI processes on a laptop Being downloaded by many Contribution to the Eclipse Consortium underway ISP can dynamically execute and reveal the space of all standard-compliant executions of MPI even when running on an arbitrary (standard-compliant) platform ISP’s internal scheduling decisions are taken in a fairly general way
One-page Ad on ISP • Verifies MPI User Applications, generating • only the RelevantProcess Interleavings • Detects all Deadlocks, Assert Violations, • MPI object leaks, and Default Safety Properties • Works by Instrumenting MPI Calls • Computing Relevant Interleavings, Replaying (BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, U of Utah)
This talk Explains the core of MPI using four letters S, R, B, W S starts a DMA send transfer, R starts a DMA receive transfer, W waits for the transfer to finish, B arranges for efficient global synchronization. [Hunch] Any attempt to create efficient message passing will result in a similar set of primitives We can now explain one-liner MPI programs that can confound even experts! This explanation is what ISP’s algorithm also uses
MPI_Isend(destination, msg_buf, request_structure, other args) This is a non-blocking call It initiates copying of msg_buf into MPI runtime so that a matching MPI Receive invoked from process destination will receive the contents of msg_buf MPI_Wait(… request_structure…) typically follows MPI_Isend When this BLOCKING call returns, the copying is finished Summary of Some MPI Commands
MPI_Isend(destination, msg_buf, request_structure, others) We will abbreviate this call as Isend(destination, request_structure) Example: Isend(2, req) .. And finally as S(2) or S(to:2) or S(to:2, req) Summary of Some MPI Commands
MPI_Irecv(source, msg_bug, request_structure, other args) This is a non-blocking call It initiates receipt into msg_buf from the MPI runtime so that a matching MPI Send invoked from process source can provide the contents of msg_buf MPI_Wait(… request_structure…) typically follows MPI_Irecv When this BLOCKING call returns, the receipt is finished Wait is abbreviated W(req) or W or … Summary of Some MPI Commands
MPI_Irecv(source, msg_bug, request_structure, other args) Abbreviated as Irecv(source, req) Example : Irecv(3, req) OR EVEN Irecv(*, req) – in case any available source would do .. Finall as R(from:3, req), R(from:3), R(3), … Summary of Some MPI Commands
MPI_Barrier(…) is abbreviated as Barrier() or even Barrier All processes must invoke Barrier before any process can return from the Barrier invocation Useful high-performance global sync. operation .. Abbreviated as B More MPI Commands
Simple MPI Program : ‘lucky.c’ Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…
Simple MPI Program : ‘lucky.c’ Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…
Simple MPI Program : ‘lucky.c’ Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… deadlock
Simple MPI Program : ‘unlucky.c’ Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…
Simple MPI Program : ‘unlucky.c’ Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… No deadlock
Runs of lucky.c and unlucky.c on mpichusing “standard testing” (“lucky” for tester) mpiccunlucky.c -ounlucky.out mpirun -np 3 ./unlucky.out (0) is alive on ganesh-desktop (2) is alive on ganesh-desktop (1) is alive on ganesh-desktop Rank 0 did Irecv Rank 1 did Send Rank 0 got 11 Sleep over Rank 2 did Send (2) Finished normally (1) Finished normally (0) Finished normally [.. OK ..] mpicclucky.c -olucky.out mpirun -np 3 ./lucky.out (0) is alive on ganesh-desktop (1) is alive on ganesh-desktop (2) is alive on ganesh-desktop Rank 0 did Irecv Rank 2 did Send Sleep over Rank 1 did Send [.. hang ..]
Runs of lucky.c and unlucky.c using ISP • ISP will find the deadlock in both cases, unaffected by the “sleep”s • The tailor-made DPOR that ISP uses, the dynamic instruction rewriting based execution control,… discussed elsewhere
How many interleavings in lucky.c? Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… > 500 interleavings without any reductions
How many relevant interleavings? Process P0 R(from:*,r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… Just two ! One for each Irecv(..) match.
MPI is tricky… till you see how it really works! Which send must be allowed to finish first? P0 --- S(to:1, big-message, h1); … S(to:2, small-message, h2); … W(h2); … W(h1); P1 --- R(from:1, buf1, h3); … W(h3); P1 --- R(from:2, buf2, h4); … W(h4);
MPI is tricky… till you see how it really works! Will this single-process example called “Auto-send”deadlock ? P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
The “Crooked Barrier” example P1 --- B; P2 --- R(from : *); B P0 --- S1(to : P2 ); B S2(to : P2 ) Can S2(to : P2 ) match R(from : *) ?
The “Crooked Barrier” example Match Across Barrier Possible ? P1 --- B; P2 --- R(from : *); B P0 --- S1(to : P2 ); B S2(to : P2 ) Can S2(to : P2 ) match R(from : *) ?
It will be good to explain all these programs without relying upon “bee dances”
MPI HB to the rescue! These pairs WITHIN A PROCESS are in the MPI HB • S(to:x); … ; S(to:x) • R(from:y); … ; R(from:y) • R(from:*); … ; R(from:any) • S(to:x, h); … ; W(h) • R(from:y, h); … ; W(h) • W(h); … ; any • B; … ; any
This HB is what makes MPI high-performance !! • S(to:x); … ; S(to:x) -- order only for non-overtaking • R(from:y); … ; R(from:y) -- ditto • R(from:*); … ; R(from:any) -- OK wildcard trumps ordinary-card • S(to:x, h); … ; W(h) -- Neat! Resource modeling hidden here! (so neat that in our latest work, this HB explains slack inelasticity!!) • R(from:y, h); … ; W(h) -- Neat too • W(h); … ; any -- One place to truly block • B; … ; any -- Another place to block!
Strictly, we must define HB on inner events • Issued -- > • Call returned -- < • Call matched -- <> • Call completed -- * • S, R go thru all four states • W has no meaningful <> (take it the same as *) • B has no meaningful * (take it the same as <>) For this talk, define HB wrt the higher level instructions themselves (see FM 2009 for details)
HB based state transition semantics • Fence = instructions that order all later program-ordered instructions via HB also (for us, they are B and W) • “Process at a fence” = Process just issued a fence instruction • During dynamic verification, each process that is not at a fenceis permitted to issue its next instruction, and then extend the HB graph • Define HB-ancestor, HB-descendent, matched-HB-ancestor • Match-enabled instruction = Whose HB-ancestors have all matched • Allow anymatch-enabled instruction to form a match-set suitably • S goes with matching R, B goes with another B • For S(to:1), S(to:2), and R(from:*), dynamically rewrite to match sets • {S(to:1), R(from:1)}, and {S(to:2), R(from:2)} • This is called an R* match-set (actually set of match-sets) • Firematch sets; an R* match-set is fired only when there are no non-R* match sets, and all processes are at a fence
How Example Auto-send works P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works The HB P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Issue R(from:0, h1), because prior to issuing R, P0 is not at a fence P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Issue B, because after issuing R, P0 is not at a fence P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Form match set; Match-enabled set is {B} P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Fire Match-enabled set {B} P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Issue S(to:0, h2) because since B is gone, P0 is no longer at a fence P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Issue W(h1) because after S(to:0, h2), P0 is not at a fence P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Can’t form a { W(h1) } match set because it has an unmatched ancestor (namely R(from:0, h1) ). P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Form and issue the { R(from:0, h1), S(to:0, h2) } match set, and issue P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Now form and issue the match set { W(h1) } P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Now issue W(h2) P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How Example Auto-send works Form match set { W(h2) } and fire it. Done. P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
The “Crooked Barrier” example P1 --- B; P2 --- R(from : *); B P0 --- S1(to : P2 ); B S2(to : P2 ) S2(to : P2 ) can match R(from : *) ! Here is how …