Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking

Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: • Prof. Ganesh Gopalakrishnan(co-advisor), Prof. Robert M. Kirby (co-advisor), • Dr. Bronis R. de Supinski (LLNL), Prof. Mary Hall and Prof. Matthew Might

Our computational ambitions are endless! • Terascale • Petascale (where we are now) • Exascale • Zettascale • Correctness is important Jaguar, Courtesy ORNL Protein Folding, Courtesy Wikipedia Computation Astrophysics, Courtesy LBL

Yet we are falling behind when it comes to correctness • Concurrent software debugging is hard • It gets harder as the degree of parallelism in applications increases • Node level: Message Passing Interface (MPI) • Core level: Threads, OpenMPI, CUDA • Hybrid programming will be the future • MPI + Threads • MPI + OpenMP • MPI + CUDA • Yet tools are lagging behind! • Many tools cannot operate at scale MPI Apps MPI Correctness Tools

We focus on dynamic verification for MPI • Lack of systematic verification tools for MPI • We need to build verification tools for MPI first • Realistic MPI programs run at large scale • Downscaling might mask bugs • MPI tools can be expanded to support hybrid programs

We choose MPI because of its ubiquity • Born 1994 when the world had 600 internet sites, 700 nm lithography, 68 MHz CPUs • Still the dominant API for HPC • Most widely supported and understood • High performance, flexible, portable

Thesis statement Scalable, modular and usable dynamic verification of realistic MPI programs is feasible and novel.

Contributions • Need scalable distributed algorithms to discover alternate schedules • Using only local states and observed matches • Matches-before (is necessary • DAMPI • Distributed Analyzer for MPI programs • Implements distributed causality tracking using matches-before • ISP (previous work) • Dynamic verifier for MPI • Implements a scheduler which exerts control to enforce matches-before • Publications: PACT 2011 (in submission), SC 2010, FM 2010, EuroPVM 2010, PPoPP 2009, EuroPVM 2009

Agenda • Motivation and Contributions • Background • MPI ordering based on Matches-Before • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions

Traditional testing is wasteful…. • Example: Deterministic operations are permuted P1 P2 … Pn MPI_Barrier MPI_Barrier … MPI_Barrier Exploring all n! permutations is wasteful

Testing can also be inconclusive; without non-determinism coverage, we can miss bugs P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Unlucky (bug missed)

Testing can also be inconclusive; without non-determinism coverage, we can miss bugs P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Lucky (bug caught!)

Verification: test all possible scenarios • Find all possible matches for the receive • Ability to track causality is a prerequisite • Replay the program and force the other matches P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

Dynamic verification of MPI • Dynamic verification combines strength of formal methods and testing • Avoids generating false alarms • Finds bugs with respect to actual binaries • Builds on the familiar approach of “testing” • Guarantee coverage over nondeterminism

Overview of Message Passing Interface (MPI) • An API specification for communication protocols between processes • Allows developers to write high performance and portable parallel code • Rich in features • Synchronous: easy to use and understand • Asynchronous: high performance • Nondeterministic constructs: reduce code complexity

MPI operations send(P,T) send(P) MPI_Send(void* buffer, int count, MPI_Datatype type, intdest, int tag, MPI_Commcomm) send(P,T)- send a message with tag T to process P MPI_Recv(void* buffer, int count, MPI_Datatype type, intsrc, int tag, MPI_Commcomm, MPI_Status status) recv(P,T)- recv a message with tag T from process P recv(*,T)- recv a message with tag T from any process recv(*,*)- recv a message with any tag from any process isend(P,T,h) – nonblocking send, communication handle h MPI_Isend(void* buffer, int count, MPI_Datatype type, intdest, int tag, MPI_Commcomm, MPI_Request h) irecv(P,T,h) – nonblocking recv, communication handle h irecv(*,T,h) irecv(*,*,h) MPI_Irecv(void* buffer, int count, MPI_Datatype type, intsrc, int tag, intcomm, MPI_Request h) wait(h) – wait for the completion of h MPI_Wait(MPI_Request h, MPI_Status status) barrier – synchronization MPI_Barrier(MPI_Commcomm)

Nonovertaking rule facilitates message matching • Sender sends two messages: • Both can match to a receive operation • First message will match before second message • Receiver posts two receives: • Both can match with an incoming message • First receive will match before second receive send(1) send(1) send(1) send(1) send(1) send(1) P0 P0 P0 P1 P1 P1 recv(0) recv(0) recv(0) recv(0) recv(*) irecv(*)

Happens-before is the basis of causality tracking • e1 happens-before () e2 iff: • e1 occurs before e2 in the same process • e1 is the sending of a message m and e2 receives it • e1 e3 and e3 e2 a e b c d f a  e b  c d  f a  c a  d a  f b  d b  f c  f a  b c  d e  f

Tracking causality with Lamport clocks • Each process keeps a clock (an integer) • Increase the clock when it has an event • Attach the clock to outgoing messages (piggyback) • Upon receiving piggybacked clock, update the clock to the value greater or equal to its clock, but higher than the piggybacked clock a e 1 2 5 2 3 4 b c d f If e1 e2 then the clock of e1 is less than the clock of e2 What about e and d? The converse does not hold!

Tracking causality with vector clocks • Each process keeps a vector of clocks (VC) • Increase its clock component when it has an event • Attach the VC to outgoing messages (piggyback) • Upon receiving piggybacked VC clock, update each component to the maximum between the current VC and the piggybacked VC a e 1,1,0 1,2,0 1,2,1 2,2,2 1,0,0 2,0,0 b c d f e1 e2iff VC of e1 is less than VC e2 What about e and d? They are concurrent!

Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions

The necessity for matches-before • The notion of happening does not mean much • Local ordering does not always hold • For example, P0: send(1); send(2); • The notion of completion also does not work P1 --- irecv (*,h) recv (*) wait(h) P0 --- send (1) send (1) The irecv(*) happens before the recv(*) but completes after it

Possible states of an MPI call • All possible states of an MPI call • Issued • Matched • Completed • Returned • It’s always possible to know exactly which state the call is in • Except for the Matched state, which has a matching window P1 isend(0,h1) barrier send(0) wait(h1) P2 irecv(0,h2) barrier recv(0) wait(h2)

Definition of matches-before • recv, barrier, and wait match before all the calls following it • sends and receives have matches-before order according to the non-overtaking rule • Nonblocking calls match before their waits • Matches-before is irreflexive, asymmetric and transitive send(0) isend(0,h1) barrier wait(h1) barrier recv(0) wait(h2) irecv(0,h2)

The role of match-sets • When a send and a receive match, they form a match-set • When barriers match, they form a match-set • Members of a match-set do not have • e M iff e (the send call or some barrier call) of M • M e iff (the receive call or some barrier call) of M e send barrier e1 barrier recv e2 e3 barrier

Previous work: centralized approach Executable Proc1 Proc2 …… Procn Scheduler Run MPI Program Interposition Layer MPI Runtime • Verifies MPI programs for deadlocks, resource leaks, assertion violations • Guarantees coverage over the space of MPI non-determinism • FM 2010, PPoPP 2009, EuroPVM 2010, EuroPVM 2009

Drawbacks of ISP • Scales only up to 32-64 processes • Large programs (of 1000s of processes) often exhibit bugs that are not triggered at low ends • Index out of bounds • Buffer overflows • MPI implementation bugs • Need a truly In-Situ verification method for codes deployed on large-scale clusters! • Verify an MPI program as deployed on a cluster

DAMPI • Distributed Analyzer of MPI Programs • Dynamic verification focusing on coverage over the space of MPI non-determinism • Verification occurs on the actual deployed code • DAMPI’s features: • Detect and enforce alternative outcomes • Scalable • User-configurable coverage

DAMPI Framework Alternate Matches Executable Proc1 Proc2 …… Procn MPI Program Epoch Decisions Schedule Generator Rerun DAMPI - PnMPI modules DAMPI – Distributed Analyzer for MPI MPI runtime

Main idea in DAMPI: Distributed Causality Tracking • Perform an initial run of the MPI program • Track causalities (discover which alternative non-deterministic matches could have occurred) • Two alternatives: • Use of Vector Clocks (thorough, but non-scalable) • Use Lamport Clocks (our choice) • Omissions possible – but only in unrealistic situations • Scalable!

DAMPI uses Lamport clocks to maintainMatches-Before barrier [0] • Use Lamport clock to track Matches-Before • Each process keeps a logical clock • Attach clock to each outgoing message • Increases it after a nondeterministic receive has matched • Mb allows us to infer when irecv’s match • Compare incoming clock to detect potential matches P0 send(1) [0] barrier [0] wait [2] P1 irecv(*) [0] recv(*) [1] Excuse me, why is the second send RED? P2 send(1) [0] barrier [0]

How we use Matches-Before relationship to detect alternative matches barrier [0] P0 send(1) [0] barrier [0] wait [2] P1 irecv(*) [0] recv(*) [1] P2 send(1) [0] barrier [0]

How we use Matches-Before relationship to detect alternative matches • Wildcard Ordering Lemma: If r is a wildcard receive: r e then C(r) < C(e) • Intuition: • The process increases the clock after r matches • If e is in the same process with r, lemma holds • If e is not in the same process, there must be some match-sets that act as “bridges” so that r e. Piggybacking ensures the lemma holds. e1 send barrier barrier e2 recv e3 barrier

How we use Matches-Before relationship to detect alternative matches barrier [0] • If S R then S R’ , which violates the match-set property • Thus, we have S R • Wildcard ordering lemma gives : R S • Thus, S and R are concurrent P0 send(1) [0] barrier [0] wait [2] P1 R = irecv(*) [0] R’= recv(*) [1] P2 S = send(1) [0] barrier [0]

Limitations of Lamport Clock S(P2) S(P2) S(P2) 0 P0 0 1 2 3 S(P3) pb(3) P1 pb(0) pb(0) pb(0) pb(0) This send is a potential match 0 S(P3) P2 R1(*) R2(*) R3(*) R(*) 0 1 P3 • Our protocol guarantees that impossible matches will not be forced (there could be deadlocks otherwise)

Lamport Clocks vs Vector Clocks • DAMPI provides two protocols • Lamport clocks: sound and scalable • Vector clocks: sound and complete • We evaluate the scalability and accuracy • Scalability: bandwidth, latency, overhead • Accuracy: omissions • The Lamport Clocks protocol does not have any omissions in practice • MPI applications have well structured communication patterns

Experiments setup • Atlas cluster in LLNL • 1152 nodes, 8 cores per node • 16GB of memory per node • MVAPICH-2 • All experiments run at 8 tasks per node • Results averaged out over five runs

Latency Impact

Bandwidth Impact

Application overhead – ParMETIS

Application overhead – AMG 2006

Application overhead – SMG2000

DAMPI’s Implementation Detail: using PnMPI Alternate Matches Executable Proc1 Proc2 …… Procn MPI Program Epoch Decisions Schedule Generator DAMPI driver DAMPI - PnMPI modules DAMPI - PnMPI modules Communicator module MPI runtime Deadlock module Type module Piggyback module Core Module Status module Request module Optional Error Checking Module

Piggyback implementation details • MPI does not provide a built-in mechanism to attach piggyback data to messages • Most common piggyback mechanisms • Attach piggyback to the buffer: • easy to use but expensive • Send piggyback as a separate message: • low overhead but has issues with wildcard receives • Using user-defined datatype to transmit piggyback • low overhead, difficult to piggyback on collectives

DAMPI uses a mixed piggyback scheme • Datatype piggyback for point-to-point • Separate message piggyback for collectives Wrapper – Piggyback Layer Piggyback Message Data pb_buf stores piggyback intMPI_Send(buf,count,user_type,…){ Create new datatypeD from pb_bufand buf return PMPI_Send(MPI_BOTTOM,1,D,…); } intMPI_Recv(buf,count,user_type,…) { Create new datatypeD from pb_buf and buf return PMPI_Recv(MPI_BOTTOM,1,D,…); } Datatype D Sending/Receiving (MPI_BOTTOM,1,D) instead of (buffer,count,user_type) Datatype D Piggyback Message Data

Experiments • Comparison with ISP: • 64-node cluster of Intel Xeon X5550 (8 cores per node, 2.67 GHZ), 24GB RAM per node • All experiments were run with 8 tasks per node • Measuring overhead of DAMPI: • 800-node cluster of AMD Opteron (16 cores per node, 2.3GHZ), 32GB RAM per node • All experiments were run with 16 tasks per node

DAMPI maintains very good scalability vs ISP

DAMPI is also faster at processing interleavings

Results on large applications: SpecMPI2007 and NAS-PB Slowdown is for one interleaving No replay was necessary

Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking

Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking

Presentation Transcript

Overview of Formal Verification

Formal Verification

Dynamic Scalable Distributed Face Recognition System Security Framework

Formal Specification and Verification of Distributed Component Systems

Formal Verification of Digital Systems

EFFICIENT DYNAMIC VERIFICATION ALGORITHMS FOR MPI APPLICATIONS

Scheduling Considerations for building Dynamic Verification Tools for MPI

Formal Verification

Efficient representation for formal verification of PLC programs

Formal Software Verification

Formal verification of software

Distributed Verification of Multi-threaded C++ Programs

Formal System Verification

Formal Verification(1)

An Integration of Dynamic MPI Formal Verification Within Eclipse PTP

Automated Formal Verification of Software

Formal Verification

Formal Verification

MPI Verification

Formal verification : SAT

Formal verification : SAT