1 / 32

joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

gl. Grand Large. MPI CH- V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project

wenda
Download Presentation

joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gl Grand Large MPICH-V2a Fault Tolerant MPI for Volatile Nodes based on PessimisticSender Based Message Logging joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project Thomas Hérault herault@lri.fr http://www.lri.fr/~herault 11/18 SC 2003

  2. MPICH-V2 • Computing nodes of clusters are subject to failure • Many applications use MPI as communication library • Design a fault-tolerant MPI library • MPICH-V1 is a fault-tolerant MPI implementation • It requires many stable components to provide high performance • MPICH-V2 addresses this requirements • And provides higher performances 11/18 SC 2003

  3. Outline • Introduction • Architecture • Performances • Perspective & Conclusion 11/18 SC 2003

  4. Large Scale Parallel and Distributed systems and node Volatility • Industry and academia are building larger and larger computing facilities for technical computing (research and production). • Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids(Seti@home, XtremWeb,Entropia, UD, Boinc) • These large scale systems have frequent failures/disconnections: • ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate. • PC Gridsnodes are volatile  disconnections / interruptions are expected to be very frequent (several/hour) • When failures/disconnections can not be avoided, they become • onecharacteristic of the system calledVolatility • We need a Volatility tolerant Message Passing library 11/18 SC 2003

  5. Programmer’s view unchanged: PC client MPI_send() PC client MPI_recv() Goal: execute existing or new MPI Apps Problems: 1) volatile nodes(any number at any time) 2) non named receptions( should be replayed in the same order as the one of the previous failed exec.) Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Scalable Infrastructure/protocols 5) Avoid global synchronizations (ckpt/restart) 6) Theoretical verification of protocols 11/18 SC 2003

  6. Related works A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Automatic Non Automatic Checkpoint based Log based Optimistic log (sender based) Pessimistic log Causal log Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] Coordinated checkpoint Manetho n faults [EZ92] Cocheck Independent of MPI [Ste96] Framework Starfish Enrichment of MPI [AF99] FT-MPI Modification of MPI routines User Fault Treatment [FD00] Egida [RAV99] Clip Semi-transparent checkpoint [CLP97] MPI/FT Redundance of tasks [BNC01] API Pruitt 98 2 faults sender based [PRU98] MPI-FT N fault Centralized server [LNLE00] MPICH-V2 N faults Distributed logging Communication Lib. Sender based Mess. Log. 1 fault sender based [JZ87] Level 11/18 SC 2003

  7. Checkpoint techniques restart Coordinated Checkpoint (Chandy/Lamport) detection/ global stop The objective is to checkpoint the application when there is no in transit messages between any two nodes  global synchronization network flush not scalable failure Ckpt Sync Nodes Uncoordinated Checkpoint • No global synchronization (scalable) • Nodes may checkpoint at any time (independently of the others) • Need to log undeterministic events: In-transit Messages restart detection failure Ckpt Nodes 11/18 SC 2003

  8. Outline • Introduction • Architecture • Performances • Perspective & Conclusion 11/18 SC 2003

  9. 2 node Network Get Put Channel Memory Time, sec 0.2 Mean over 100 measurements P4 ch_cm 5.6 MB/s 0.15 X ~2 0.1 10.5 MB/s 0.05 0 size, Kb 0 64 128 192 256 320 384 MPICH-V1 Dispatcher Channel Memories node Network Checkpoint servers node node 11/18 SC 2003

  10. MPICH-V2 protocol A new protocol (never published yet) based on 1) Splitting message logging and event logging 2) Sender based message logging 3) Pessimistic approach (reliable event logger) • Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E. • P is a pessimistic message logging protocol if and only if • CE,m  MC, • (|DependC(m)| > 1) ) Re − Executable(m) Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol. Key points of the proof: A. Every non deterministic event has its logical clock logged on reliable media B. Every message reception logged on reliable media is reexecutable the message payload is saved on the sender the sender will produce the message again and associate the same unique logical clock 11/18 SC 2003

  11. Message logger and event logger q m p crash (id, l) event logger for p r q D B C p restart A event logger for p reexecution phase r 11/18 SC 2003

  12. Computing node Event Logger Ckpt Server Reception event Checkpoint Image ack CSAC MPI process Send Send V2 daemon Ckpt Control Receive Receive keep payload Node 11/18 SC 2003

  13. Impact of uncoordinated checkpoint+ sender based message logging 1 2 EL 1, 2 ? Checkpoint image Checkpoint image P0 ? ? Checkpoint image Checkpoint image P1 P1’s ML 1 2 1 CS • Obligation to checkpoint Message Loggers on • computing nodes • Garbage collector required for reducing ML checkpoint size. 11/18 SC 2003

  14. Garbage collection 1 2 EL Checkpoint image P0 Checkpoint image P1 P1’s ML 1 2 1 2 3 3 1 1 and 2 can be deleted  Garbage collector CS Receiver checkpoint completion triggers the garbage collector of senders. 11/18 SC 2003

  15. Scheduling Checkpoint • Uncoordinated checkpoint lead to log in-transit messages • Scheduling checkpoint simultaneously will lead to bursts • in the network traffic. • Checkpoint size can be reduced by removing message logs • Coordinated checkpoint (Lamport). • Requires global synchronization • Checkpoint traffic should be flattened • Checkpoint scheduling should evaluate the cost and benefit • of each checkpoint. 1, 2 and 3 can be deleted  Garbage collector 1 2 1 2 3 1 P0’s ML P0 No message Checkpoint needed 3 needs to be checkpointed P1 P1’s ML 1 2 1 2 3 1 1 and 2 can be deleted  Garbage collector CS 11/18 SC 2003

  16. Node (Volatile) :Checkpointing • User-level Checkpoint : Condor Stand Alone Checkpointing • Clone checkpointing + non blocking checkpoint (1) fork Resume execution using CSAC just after (4), reopen sockets and return code Ckpt order CSAC (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() libmpichv fork • Checkpoint image is sent to CS on the fly (not stored locally) 11/18 SC 2003

  17. ADI _v2bsend - blocking send _v2brecv - blocking receive Channel Interface _v2probe - check for any message avail. Chameleon Interface Library: based on MPICH 1.2.5 • A new device: ‘ch_v2’ device • All ch_v2 device functions are blocking communication functions built over TCP layer MPI_Send MPID_SendControl MPID_SendChannel _v2from - get the src of the last message Binding _v2Init - initialize the client V2 device Interface _v2bsend _v2Finalize - finalize the client 11/18 SC 2003

  18. Outline • Introduction • Architecture • Performances • Perspective & Conclusion 11/18 SC 2003

  19. Performance evaluation Cluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc + 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc + 48 ports 100Mb/s Ethernet switch Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp) Checkpoint Server +Event Logger +Checkpoint Scheduler +Dispatcher A single reliable node node Network node node 11/18 SC 2003

  20. Bandwidth and Latency Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us) Latency is high due to the event logging.  A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication) Bandwidth is high because event messages are short. 11/18 SC 2003

  21. Latency Memory capacity (logging on disc) NAS Benchmark Class A and B Megaflops Megaflops Megaflops 11/18 SC 2003

  22. Breakdown of the execution time 11/18 SC 2003

  23. Faulty execution performance 1 fault Every 45 sec! +190 s (+80%) 11/18 SC 2003

  24. Outline • Introduction • Architecture • Performances • Perspective & Conclusion 11/18 SC 2003

  25. Perspectives • Compare to Coordinated techniques • Treshold of fault frequency where logging techniques are more valuable • MPICH-V/CL • Cluster 2003 • Hierarchical logging for Grids • Tolerate node failures & cluster failures • MPICH-V3 • SC 2003 Poster session • Address the latency of MPICH-V2 • Use causal logging techniques ? 11/18 SC 2003

  26. Conclusion • MPICH-V2 is a completely new protocol replacing MPICH-V1 removing the channel memories • New protocol is pessimistic sender based • MPICH-V2 reach a Ping-Pong Bandwidth • close to the one of MPICH-P4 • MPICH-V2 cannot compete with MPICH-P4 on latency • However for applications with large messages, performance • are close to the one of P4 • In addition, MPICH-V2 resists up to one fault every 45 seconds. • Main conclusion: MPICH-V2 requires much less stable nodes than MPICH-V1 with better performances Come to see MPICH-V demos at the Booth: 3315 INRIA 11/18 SC 2003

  27. Crash Re-execution performance (1) Time for the re-execution of a token ring on 8 nodes According to the token size and number of re-started nodes 11/18 SC 2003

  28. Re-execution performance (2) 11/18 SC 2003

  29. Logging techniques Initial execution crash ckpt The system must provide the messages to be replayed, and discard the re-emissions Replayed execution : starts from last checkpoint (this process) • Main problems: • Discard re-emissions (technical) • Ensure that messages are replayed • in a consistent order 11/18 SC 2003

  30. Large Scale Parallel and Distributed Systems and programing • Many HPC applications use message passing paradigm • Message passing :MPI • We need a Volatility tolerant Message Passing Interface implementation • Based on MPICH-1.2.5 which implements MPI standard 1.1 11/18 SC 2003

  31. Checkpoint Server (stable) Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Disc Checkpoint images Multiprocess server Poll, treat event and dispatch job to other processes Incoming Message (Put ckpt transaction) Outgoing Message (Get ckpt transaction + control) Open Sockets: -one per attached Node -one per home CM of attached Nodes 11/18 SC 2003

  32. NAS Benchmark Class A and B Latency Memory capacity (logging on disc) 11/18 SC 2003

More Related