1 / 68

Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking

Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking. Dissertation Defense Anh Vo. Committee: Prof. Ganesh Gopalakrishnan (co-advisor ), Prof. Robert M. Kirby (co-advisor),

miriam
Download Presentation

Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: • Prof. Ganesh Gopalakrishnan(co-advisor), Prof. Robert M. Kirby (co-advisor), • Dr. Bronis R. de Supinski (LLNL), Prof. Mary Hall and Prof. Matthew Might

  2. Our computational ambitions are endless! • Terascale • Petascale (where we are now) • Exascale • Zettascale • Correctness is important Jaguar, Courtesy ORNL Protein Folding, Courtesy Wikipedia Computation Astrophysics, Courtesy LBL

  3. Yet we are falling behind when it comes to correctness • Concurrent software debugging is hard • It gets harder as the degree of parallelism in applications increases • Node level: Message Passing Interface (MPI) • Core level: Threads, OpenMPI, CUDA • Hybrid programming will be the future • MPI + Threads • MPI + OpenMP • MPI + CUDA • Yet tools are lagging behind! • Many tools cannot operate at scale MPI Apps MPI Correctness Tools

  4. We focus on dynamic verification for MPI • Lack of systematic verification tools for MPI • We need to build verification tools for MPI first • Realistic MPI programs run at large scale • Downscaling might mask bugs • MPI tools can be expanded to support hybrid programs

  5. We choose MPI because of its ubiquity • Born 1994 when the world had 600 internet sites, 700 nm lithography, 68 MHz CPUs • Still the dominant API for HPC • Most widely supported and understood • High performance, flexible, portable

  6. Thesis statement Scalable, modular and usable dynamic verification of realistic MPI programs is feasible and novel.

  7. Contributions • Need scalable distributed algorithms to discover alternate schedules • Using only local states and observed matches • Matches-before (is necessary • DAMPI • Distributed Analyzer for MPI programs • Implements distributed causality tracking using matches-before • ISP (previous work) • Dynamic verifier for MPI • Implements a scheduler which exerts control to enforce matches-before • Publications: PACT 2011 (in submission), SC 2010, FM 2010, EuroPVM 2010, PPoPP 2009, EuroPVM 2009

  8. Agenda • Motivation and Contributions • Background • MPI ordering based on Matches-Before • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions

  9. Traditional testing is wasteful…. • Example: Deterministic operations are permuted P1 P2 … Pn MPI_Barrier MPI_Barrier … MPI_Barrier Exploring all n! permutations is wasteful

  10. Testing can also be inconclusive; without non-determinism coverage, we can miss bugs P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Unlucky (bug missed)

  11. Testing can also be inconclusive; without non-determinism coverage, we can miss bugs P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Lucky (bug caught!)

  12. Verification: test all possible scenarios • Find all possible matches for the receive • Ability to track causality is a prerequisite • Replay the program and force the other matches P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) thenERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

  13. Dynamic verification of MPI • Dynamic verification combines strength of formal methods and testing • Avoids generating false alarms • Finds bugs with respect to actual binaries • Builds on the familiar approach of “testing” • Guarantee coverage over nondeterminism

  14. Overview of Message Passing Interface (MPI) • An API specification for communication protocols between processes • Allows developers to write high performance and portable parallel code • Rich in features • Synchronous: easy to use and understand • Asynchronous: high performance • Nondeterministic constructs: reduce code complexity

  15. MPI operations send(P,T) send(P) MPI_Send(void* buffer, int count, MPI_Datatype type, intdest, int tag, MPI_Commcomm) send(P,T)- send a message with tag T to process P MPI_Recv(void* buffer, int count, MPI_Datatype type, intsrc, int tag, MPI_Commcomm, MPI_Status status) recv(P,T)- recv a message with tag T from process P recv(*,T)- recv a message with tag T from any process recv(*,*)- recv a message with any tag from any process isend(P,T,h) – nonblocking send, communication handle h MPI_Isend(void* buffer, int count, MPI_Datatype type, intdest, int tag, MPI_Commcomm, MPI_Request h) irecv(P,T,h) – nonblocking recv, communication handle h irecv(*,T,h) irecv(*,*,h) MPI_Irecv(void* buffer, int count, MPI_Datatype type, intsrc, int tag, intcomm, MPI_Request h) wait(h) – wait for the completion of h MPI_Wait(MPI_Request h, MPI_Status status) barrier – synchronization MPI_Barrier(MPI_Commcomm)

  16. Nonovertaking rule facilitates message matching • Sender sends two messages: • Both can match to a receive operation • First message will match before second message • Receiver posts two receives: • Both can match with an incoming message • First receive will match before second receive send(1) send(1) send(1) send(1) send(1) send(1) P0 P0 P0 P1 P1 P1 recv(0) recv(0) recv(0) recv(0) recv(*) irecv(*)

  17. Happens-before is the basis of causality tracking • e1 happens-before () e2 iff: • e1 occurs before e2 in the same process • e1 is the sending of a message m and e2 receives it • e1 e3 and e3 e2 a e b c d f a  e b  c d  f a  c a  d a  f b  d b  f c  f a  b c  d e  f

  18. Tracking causality with Lamport clocks • Each process keeps a clock (an integer) • Increase the clock when it has an event • Attach the clock to outgoing messages (piggyback) • Upon receiving piggybacked clock, update the clock to the value greater or equal to its clock, but higher than the piggybacked clock a e 1 2 5 2 3 4 b c d f If e1 e2 then the clock of e1 is less than the clock of e2 What about e and d? The converse does not hold!

  19. Tracking causality with vector clocks • Each process keeps a vector of clocks (VC) • Increase its clock component when it has an event • Attach the VC to outgoing messages (piggyback) • Upon receiving piggybacked VC clock, update each component to the maximum between the current VC and the piggybacked VC a e 1,1,0 1,2,0 1,2,1 2,2,2 1,0,0 2,0,0 b c d f e1 e2iff VC of e1 is less than VC e2 What about e and d? They are concurrent!

  20. Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions

  21. The necessity for matches-before • The notion of happening does not mean much • Local ordering does not always hold • For example, P0: send(1); send(2); • The notion of completion also does not work P1 --- irecv (*,h) recv (*) wait(h) P0 --- send (1) send (1) The irecv(*) happens before the recv(*) but completes after it

  22. Possible states of an MPI call • All possible states of an MPI call • Issued • Matched • Completed • Returned • It’s always possible to know exactly which state the call is in • Except for the Matched state, which has a matching window P1 isend(0,h1) barrier send(0) wait(h1) P2 irecv(0,h2) barrier recv(0) wait(h2)

  23. Definition of matches-before • recv, barrier, and wait match before all the calls following it • sends and receives have matches-before order according to the non-overtaking rule • Nonblocking calls match before their waits • Matches-before is irreflexive, asymmetric and transitive send(0) isend(0,h1) barrier wait(h1) barrier recv(0) wait(h2) irecv(0,h2)

  24. The role of match-sets • When a send and a receive match, they form a match-set • When barriers match, they form a match-set • Members of a match-set do not have • e M iff e (the send call or some barrier call) of M • M e iff (the receive call or some barrier call) of M e send barrier e1 barrier recv e2 e3 barrier

  25. Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions

  26. Previous work: centralized approach Executable Proc1 Proc2 …… Procn Scheduler Run MPI Program Interposition Layer MPI Runtime • Verifies MPI programs for deadlocks, resource leaks, assertion violations • Guarantees coverage over the space of MPI non-determinism • FM 2010, PPoPP 2009, EuroPVM 2010, EuroPVM 2009

  27. Drawbacks of ISP • Scales only up to 32-64 processes • Large programs (of 1000s of processes) often exhibit bugs that are not triggered at low ends • Index out of bounds • Buffer overflows • MPI implementation bugs • Need a truly In-Situ verification method for codes deployed on large-scale clusters! • Verify an MPI program as deployed on a cluster

  28. Agenda • Motivation and Contributions • Background • Matches-Before for MPI • The centralized approach: ISP • The distributed approach: DAMPI • Conclusions

  29. DAMPI • Distributed Analyzer of MPI Programs • Dynamic verification focusing on coverage over the space of MPI non-determinism • Verification occurs on the actual deployed code • DAMPI’s features: • Detect and enforce alternative outcomes • Scalable • User-configurable coverage

  30. DAMPI Framework Alternate Matches Executable Proc1 Proc2 …… Procn MPI Program Epoch Decisions Schedule Generator Rerun DAMPI - PnMPI modules DAMPI – Distributed Analyzer for MPI MPI runtime

  31. Main idea in DAMPI: Distributed Causality Tracking • Perform an initial run of the MPI program • Track causalities (discover which alternative non-deterministic matches could have occurred) • Two alternatives: • Use of Vector Clocks (thorough, but non-scalable) • Use Lamport Clocks (our choice) • Omissions possible – but only in unrealistic situations • Scalable!

  32. DAMPI uses Lamport clocks to maintainMatches-Before barrier [0] • Use Lamport clock to track Matches-Before • Each process keeps a logical clock • Attach clock to each outgoing message • Increases it after a nondeterministic receive has matched • Mb allows us to infer when irecv’s match • Compare incoming clock to detect potential matches P0 send(1) [0] barrier [0] wait [2] P1 irecv(*) [0] recv(*) [1] Excuse me, why is the second send RED? P2 send(1) [0] barrier [0]

  33. How we use Matches-Before relationship to detect alternative matches barrier [0] P0 send(1) [0] barrier [0] wait [2] P1 irecv(*) [0] recv(*) [1] P2 send(1) [0] barrier [0]

  34. How we use Matches-Before relationship to detect alternative matches • Wildcard Ordering Lemma: If r is a wildcard receive: r e then C(r) < C(e) • Intuition: • The process increases the clock after r matches • If e is in the same process with r, lemma holds • If e is not in the same process, there must be some match-sets that act as “bridges” so that r e. Piggybacking ensures the lemma holds. e1 send barrier barrier e2 recv e3 barrier

  35. How we use Matches-Before relationship to detect alternative matches barrier [0] • If S R then S R’ , which violates the match-set property • Thus, we have S R • Wildcard ordering lemma gives : R S • Thus, S and R are concurrent P0 send(1) [0] barrier [0] wait [2] P1 R = irecv(*) [0] R’= recv(*) [1] P2 S = send(1) [0] barrier [0]

  36. Limitations of Lamport Clock S(P2) S(P2) S(P2) 0 P0 0 1 2 3 S(P3) pb(3) P1 pb(0) pb(0) pb(0) pb(0) This send is a potential match 0 S(P3) P2 R1(*) R2(*) R3(*) R(*) 0 1 P3 • Our protocol guarantees that impossible matches will not be forced (there could be deadlocks otherwise)

  37. Lamport Clocks vs Vector Clocks • DAMPI provides two protocols • Lamport clocks: sound and scalable • Vector clocks: sound and complete • We evaluate the scalability and accuracy • Scalability: bandwidth, latency, overhead • Accuracy: omissions • The Lamport Clocks protocol does not have any omissions in practice • MPI applications have well structured communication patterns

  38. Experiments setup • Atlas cluster in LLNL • 1152 nodes, 8 cores per node • 16GB of memory per node • MVAPICH-2 • All experiments run at 8 tasks per node • Results averaged out over five runs

  39. Latency Impact

  40. Bandwidth Impact

  41. Application overhead – ParMETIS

  42. Application overhead – AMG 2006

  43. Application overhead – SMG2000

  44. DAMPI’s Implementation Detail: using PnMPI Alternate Matches Executable Proc1 Proc2 …… Procn MPI Program Epoch Decisions Schedule Generator DAMPI driver DAMPI - PnMPI modules DAMPI - PnMPI modules Communicator module MPI runtime Deadlock module Type module Piggyback module Core Module Status module Request module Optional Error Checking Module

  45. Piggyback implementation details • MPI does not provide a built-in mechanism to attach piggyback data to messages • Most common piggyback mechanisms • Attach piggyback to the buffer: • easy to use but expensive • Send piggyback as a separate message: • low overhead but has issues with wildcard receives • Using user-defined datatype to transmit piggyback • low overhead, difficult to piggyback on collectives

  46. DAMPI uses a mixed piggyback scheme • Datatype piggyback for point-to-point • Separate message piggyback for collectives Wrapper – Piggyback Layer Piggyback Message Data pb_buf stores piggyback intMPI_Send(buf,count,user_type,…){ Create new datatypeD from pb_bufand buf return PMPI_Send(MPI_BOTTOM,1,D,…); } intMPI_Recv(buf,count,user_type,…) { Create new datatypeD from pb_buf and buf return PMPI_Recv(MPI_BOTTOM,1,D,…); } Datatype D Sending/Receiving (MPI_BOTTOM,1,D) instead of (buffer,count,user_type) Datatype D Piggyback Message Data

  47. Experiments • Comparison with ISP: • 64-node cluster of Intel Xeon X5550 (8 cores per node, 2.67 GHZ), 24GB RAM per node • All experiments were run with 8 tasks per node • Measuring overhead of DAMPI: • 800-node cluster of AMD Opteron (16 cores per node, 2.3GHZ), 32GB RAM per node • All experiments were run with 16 tasks per node

  48. DAMPI maintains very good scalability vs ISP

  49. DAMPI is also faster at processing interleavings

  50. Results on large applications: SpecMPI2007 and NAS-PB Slowdown is for one interleaving No replay was necessary

More Related