680 likes | 812 Views
Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking. Dissertation Defense Anh Vo. Committee: Prof. Ganesh Gopalakrishnan (co-advisor ), Prof. Robert M. Kirby (co-advisor),
E N D
Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: • Prof. Ganesh Gopalakrishnan(co-advisor), Prof. Robert M. Kirby (co-advisor), • Dr. Bronis R. de Supinski (LLNL), Prof. Mary Hall and Prof. Matthew Might
Our computational ambitions are endless! • Terascale • Petascale (where we are now) • Exascale • Zettascale • Correctness is important Jaguar, Courtesy ORNL Protein Folding, Courtesy Wikipedia Computation Astrophysics, Courtesy LBL
Yet we are falling behind when it comes to correctness • Concurrent software debugging is hard • It gets harder as the degree of parallelism in applications increases • Node level: Message Passing Interface (MPI) • Core level: Threads, OpenMPI, CUDA • Hybrid programming will be the future • MPI + Threads • MPI + OpenMP • MPI + CUDA • Yet tools are lagging behind! • Many tools cannot operate at scale MPI Apps MPI Correctness Tools
We focus on dynamic verification for MPI • Lack of systematic verification tools for MPI • We need to build verification tools for MPI first • Realistic MPI programs run at large scale • Downscaling might mask bugs • MPI tools can be expanded to support hybrid programs
We choose MPI because of its ubiquity • Born 1994 when the world had 600 internet sites, 700 nm lithography, 68 MHz CPUs • Still the dominant API for HPC • Most widely supported and understood • High performance, flexible, portable
Thesis statement Scalable, modular and usable dynamic verification of realistic MPI programs is feasible and novel.
Contributions • Need scalable distributed algorithms to discover alternate schedules • Using only local states and observed matches • Matches-before (is necessary • DAMPI • Distributed Analyzer for MPI programs • Implements distributed causality tracking using matches-before • ISP (previous work) • Dynamic verifier for MPI • Implements a scheduler which exerts control to enforce matches-before • Publications: PACT 2011 (in submission), SC 2010, FM 2010, EuroPVM 2010, PPoPP 2009, EuroPVM 2009
Agenda • Motivation and Contributions • Background • MPI ordering based on Matches-Before • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions
Traditional testing is wasteful…. • Example: Deterministic operations are permuted P1 P2 … Pn MPI_Barrier MPI_Barrier … MPI_Barrier Exploring all n! permutations is wasteful
Testing can also be inconclusive; without non-determinism coverage, we can miss bugs P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Unlucky (bug missed)
Testing can also be inconclusive; without non-determinism coverage, we can miss bugs P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Lucky (bug caught!)
Verification: test all possible scenarios • Find all possible matches for the receive • Ability to track causality is a prerequisite • Replay the program and force the other matches P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);
Dynamic verification of MPI • Dynamic verification combines strength of formal methods and testing • Avoids generating false alarms • Finds bugs with respect to actual binaries • Builds on the familiar approach of “testing” • Guarantee coverage over nondeterminism
Overview of Message Passing Interface (MPI) • An API specification for communication protocols between processes • Allows developers to write high performance and portable parallel code • Rich in features • Synchronous: easy to use and understand • Asynchronous: high performance • Nondeterministic constructs: reduce code complexity
MPI operations send(P,T) send(P) MPI_Send(void* buffer, int count, MPI_Datatype type, intdest, int tag, MPI_Commcomm) send(P,T)- send a message with tag T to process P MPI_Recv(void* buffer, int count, MPI_Datatype type, intsrc, int tag, MPI_Commcomm, MPI_Status status) recv(P,T)- recv a message with tag T from process P recv(*,T)- recv a message with tag T from any process recv(*,*)- recv a message with any tag from any process isend(P,T,h) – nonblocking send, communication handle h MPI_Isend(void* buffer, int count, MPI_Datatype type, intdest, int tag, MPI_Commcomm, MPI_Request h) irecv(P,T,h) – nonblocking recv, communication handle h irecv(*,T,h) irecv(*,*,h) MPI_Irecv(void* buffer, int count, MPI_Datatype type, intsrc, int tag, intcomm, MPI_Request h) wait(h) – wait for the completion of h MPI_Wait(MPI_Request h, MPI_Status status) barrier – synchronization MPI_Barrier(MPI_Commcomm)
Nonovertaking rule facilitates message matching • Sender sends two messages: • Both can match to a receive operation • First message will match before second message • Receiver posts two receives: • Both can match with an incoming message • First receive will match before second receive send(1) send(1) send(1) send(1) send(1) send(1) P0 P0 P0 P1 P1 P1 recv(0) recv(0) recv(0) recv(0) recv(*) irecv(*)
Happens-before is the basis of causality tracking • e1 happens-before () e2 iff: • e1 occurs before e2 in the same process • e1 is the sending of a message m and e2 receives it • e1 e3 and e3 e2 a e b c d f a e b c d f a c a d a f b d b f c f a b c d e f
Tracking causality with Lamport clocks • Each process keeps a clock (an integer) • Increase the clock when it has an event • Attach the clock to outgoing messages (piggyback) • Upon receiving piggybacked clock, update the clock to the value greater or equal to its clock, but higher than the piggybacked clock a e 1 2 5 2 3 4 b c d f If e1 e2 then the clock of e1 is less than the clock of e2 What about e and d? The converse does not hold!
Tracking causality with vector clocks • Each process keeps a vector of clocks (VC) • Increase its clock component when it has an event • Attach the VC to outgoing messages (piggyback) • Upon receiving piggybacked VC clock, update each component to the maximum between the current VC and the piggybacked VC a e 1,1,0 1,2,0 1,2,1 2,2,2 1,0,0 2,0,0 b c d f e1 e2iff VC of e1 is less than VC e2 What about e and d? They are concurrent!
Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions
The necessity for matches-before • The notion of happening does not mean much • Local ordering does not always hold • For example, P0: send(1); send(2); • The notion of completion also does not work P1 --- irecv (*,h) recv (*) wait(h) P0 --- send (1) send (1) The irecv(*) happens before the recv(*) but completes after it
Possible states of an MPI call • All possible states of an MPI call • Issued • Matched • Completed • Returned • It’s always possible to know exactly which state the call is in • Except for the Matched state, which has a matching window P1 isend(0,h1) barrier send(0) wait(h1) P2 irecv(0,h2) barrier recv(0) wait(h2)
Definition of matches-before • recv, barrier, and wait match before all the calls following it • sends and receives have matches-before order according to the non-overtaking rule • Nonblocking calls match before their waits • Matches-before is irreflexive, asymmetric and transitive send(0) isend(0,h1) barrier wait(h1) barrier recv(0) wait(h2) irecv(0,h2)
The role of match-sets • When a send and a receive match, they form a match-set • When barriers match, they form a match-set • Members of a match-set do not have • e M iff e (the send call or some barrier call) of M • M e iff (the receive call or some barrier call) of M e send barrier e1 barrier recv e2 e3 barrier
Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions
Previous work: centralized approach Executable Proc1 Proc2 …… Procn Scheduler Run MPI Program Interposition Layer MPI Runtime • Verifies MPI programs for deadlocks, resource leaks, assertion violations • Guarantees coverage over the space of MPI non-determinism • FM 2010, PPoPP 2009, EuroPVM 2010, EuroPVM 2009
Drawbacks of ISP • Scales only up to 32-64 processes • Large programs (of 1000s of processes) often exhibit bugs that are not triggered at low ends • Index out of bounds • Buffer overflows • MPI implementation bugs • Need a truly In-Situ verification method for codes deployed on large-scale clusters! • Verify an MPI program as deployed on a cluster
Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions
DAMPI • Distributed Analyzer of MPI Programs • Dynamic verification focusing on coverage over the space of MPI non-determinism • Verification occurs on the actual deployed code • DAMPI’s features: • Detect and enforce alternative outcomes • Scalable • User-configurable coverage
DAMPI Framework Alternate Matches Executable Proc1 Proc2 …… Procn MPI Program Epoch Decisions Schedule Generator Rerun DAMPI - PnMPI modules DAMPI – Distributed Analyzer for MPI MPI runtime
Main idea in DAMPI: Distributed Causality Tracking • Perform an initial run of the MPI program • Track causalities (discover which alternative non-deterministic matches could have occurred) • Two alternatives: • Use of Vector Clocks (thorough, but non-scalable) • Use Lamport Clocks (our choice) • Omissions possible – but only in unrealistic situations • Scalable!
DAMPI uses Lamport clocks to maintainMatches-Before barrier [0] • Use Lamport clock to track Matches-Before • Each process keeps a logical clock • Attach clock to each outgoing message • Increases it after a nondeterministic receive has matched • Mb allows us to infer when irecv’s match • Compare incoming clock to detect potential matches P0 send(1) [0] barrier [0] wait [2] P1 irecv(*) [0] recv(*) [1] Excuse me, why is the second send RED? P2 send(1) [0] barrier [0]
How we use Matches-Before relationship to detect alternative matches barrier [0] P0 send(1) [0] barrier [0] wait [2] P1 irecv(*) [0] recv(*) [1] P2 send(1) [0] barrier [0]
How we use Matches-Before relationship to detect alternative matches • Wildcard Ordering Lemma: If r is a wildcard receive: r e then C(r) < C(e) • Intuition: • The process increases the clock after r matches • If e is in the same process with r, lemma holds • If e is not in the same process, there must be some match-sets that act as “bridges” so that r e. Piggybacking ensures the lemma holds. e1 send barrier barrier e2 recv e3 barrier
How we use Matches-Before relationship to detect alternative matches barrier [0] • If S R then S R’ , which violates the match-set property • Thus, we have S R • Wildcard ordering lemma gives : R S • Thus, S and R are concurrent P0 send(1) [0] barrier [0] wait [2] P1 R = irecv(*) [0] R’= recv(*) [1] P2 S = send(1) [0] barrier [0]
Limitations of Lamport Clock S(P2) S(P2) S(P2) 0 P0 0 1 2 3 S(P3) pb(3) P1 pb(0) pb(0) pb(0) pb(0) This send is a potential match 0 S(P3) P2 R1(*) R2(*) R3(*) R(*) 0 1 P3 • Our protocol guarantees that impossible matches will not be forced (there could be deadlocks otherwise)
Lamport Clocks vs Vector Clocks • DAMPI provides two protocols • Lamport clocks: sound and scalable • Vector clocks: sound and complete • We evaluate the scalability and accuracy • Scalability: bandwidth, latency, overhead • Accuracy: omissions • The Lamport Clocks protocol does not have any omissions in practice • MPI applications have well structured communication patterns
Experiments setup • Atlas cluster in LLNL • 1152 nodes, 8 cores per node • 16GB of memory per node • MVAPICH-2 • All experiments run at 8 tasks per node • Results averaged out over five runs
DAMPI’s Implementation Detail: using PnMPI Alternate Matches Executable Proc1 Proc2 …… Procn MPI Program Epoch Decisions Schedule Generator DAMPI driver DAMPI - PnMPI modules DAMPI - PnMPI modules Communicator module MPI runtime Deadlock module Type module Piggyback module Core Module Status module Request module Optional Error Checking Module
Piggyback implementation details • MPI does not provide a built-in mechanism to attach piggyback data to messages • Most common piggyback mechanisms • Attach piggyback to the buffer: • easy to use but expensive • Send piggyback as a separate message: • low overhead but has issues with wildcard receives • Using user-defined datatype to transmit piggyback • low overhead, difficult to piggyback on collectives
DAMPI uses a mixed piggyback scheme • Datatype piggyback for point-to-point • Separate message piggyback for collectives Wrapper – Piggyback Layer Piggyback Message Data pb_buf stores piggyback intMPI_Send(buf,count,user_type,…){ Create new datatypeD from pb_bufand buf return PMPI_Send(MPI_BOTTOM,1,D,…); } intMPI_Recv(buf,count,user_type,…) { Create new datatypeD from pb_buf and buf return PMPI_Recv(MPI_BOTTOM,1,D,…); } Datatype D Sending/Receiving (MPI_BOTTOM,1,D) instead of (buffer,count,user_type) Datatype D Piggyback Message Data
Experiments • Comparison with ISP: • 64-node cluster of Intel Xeon X5550 (8 cores per node, 2.67 GHZ), 24GB RAM per node • All experiments were run with 8 tasks per node • Measuring overhead of DAMPI: • 800-node cluster of AMD Opteron (16 cores per node, 2.3GHZ), 32GB RAM per node • All experiments were run with 16 tasks per node
Results on large applications: SpecMPI2007 and NAS-PB Slowdown is for one interleaving No replay was necessary