Civilian Worms: Ensuring Reliability in an Unreliable Environment

Civilian Worms: Ensuring Reliability in an Unreliable Environment Sanjeev R. Kulkarni University of Wisconsin-Madison sanjeevk@cs.wisc.edu Joint Work with Sambavi Muthukrishnan

Outline • Motivation and Goals • Civilian Worms • Master-Worker Model • Leader Election • Forward Progress • Correctness • Parallel Applications

What’s happening today • Move towards clusters • Resource Managers • eg. Condor • Dynamic environment

Motivation • Large Parallel/Standalone Applications • Non-Dedicated Resources • eg.:- Condor env. • Machines can disappear at any time • Unreliable commodity clusters • Hardware failures • Network Failures • Security Attacks!

What’s available • Parallel Platforms • MPI • MPI-1 :- Machines can’t go away! • MPI-2 any takers? • PVM • Shoot the master! • Condor • Shoot the Central Manager!

Goal • Bottleneck-Free infrastructure in an unreliable Environment • Ensure “normal termination” of applications • Users submit their jobs • Get e-mail upon completion!

Focus of this talk • Approaches for Reliability • Standalone Applications • Monitor framework ( worms! ) • Replication • Parallel Applications • Future work!

Worms are here again! • Usual Worms • Self replicating • Hard to detect and kill • Civilian Worms • Controlled replication • Spread legally! • Monitor applications

Desired Monitoring System W = worm C = computation

Issues • Management of worms • Distributed State detection • Very hard • Forward Progress • Checkpointing • Correctness

Management Models • Master-Worker • Simple • Effective • Our Choice! • Symmetric • Difficult to manage the model itself!

Our Implementation Model Master W = worm C = computation Workers

Worm States • Master • Maintains the state of all the worm segments • Listens on a particular socket • Respawns failed worm segments • Worker • Periodically ping the master • Starts the encapsulated process if instructed • Leader Election • Invoke the LE algorithm to elect a new master • Note:- Independent of application State

Leader Election • The woes begin! • Master goes down • Detection • Worker ping times out • Timeout value • Worker gets an LE message • Action • Worker goes into LE state

LE algorithm • Each worm segment is given an ID • Only master gives the id • Workers broadcast their ids • The worker with the lowest id wins

Brief Skeleton • While in LE • bcast LE message with your id • Set min = your id • On getting an LE message with id i • If i >= min ignore • else min = i; • min is the new Master

LE in action (1) M0 W2 W1 Master goes down!

LE in action (2) LE, 1 LE, 2 L2 L1 LE, 1 LE, 2 L1 and L2 send out LE messages

LE in action (3) COORD_ACK L2 L1 L1 gets LE, 2 and ignores it L2 gets LE, 1 and send COORD_ACK

LE in action (4) W3 spawn COORD W2 M1 M1 send COORD to W2, spawns W0

Implementation Problems • Too many cases • Many unclear cases • Time to Converge • Timeout values • Network Partition

What happens if? • Master still up? • Incoming id < self id => goes to LE mode • Else => sends back COORD message • Next master in line goes down? • Timeout on COORD message receipt • Late COORD_ACK? • Sends KILL message

More Bizarre cases • Multiple Masters? • Master bcasts its id periodically • Conflict is resolved using lowest id method • No-master? • Workers will timeout soon!

Test-Bed • 64 dual processor 550 MHz P-III nodes • Linux 2.2.12 • 2 GB RAM • Fast interconnect. 100 Mbps • Master-Worker comm. via UDP

A Stress Test for LE • Test • Worker Pings every second • Kill n/4 workers • After 1 sec, kill the master • After .5 sec kill the master in line • Kill n/4 workers again

Convergence

Forward Progress • Why? • MTTF < application time • Solutions • Checkpointing • Application Level • Process level • Start from checkpoint image!

Checkpoint • Address Space • Condor Checkpoint library • Rewrites Object files • Writes checkpoint to a file on SIGUSR2 • Files • Assumption :- Common File System

Correctness • File Access • Read Only, no problems • Writes • Possible inconsistency if multiple processes access • Inconsistency across checkpoints? • Need a new File Access Algorithm

Solution: Individual Versions • File Access Algorithm • On open • If first open • read: nothing • write: create a local copy and set a mapping • Else • If mapped access mapped file • If write: create a local copy and set a mapping • Close • Preserve the mapping

File Access cont. • Commit Point • On completion of the computation • Checkpoint • Includes mapped files

Being more Fancy • Security Attacks • Civilian to Military transition • Hide yourself from the ps • Re-fork periodically to avoid detection

Conclusion • LE is VERY HARD • Don’t take it for a course project! • Does our system work? • 16 nodes: YES • 32 nodes: NO • Quite Reliable

Future Direction • Robustness • Extension to parallel programs • Re-write send/recv calls • Routing issues • Scalability issues? • A hierarchical design?

References • Cohen, F. B., ‘A Case for Benevolent Viruses’, http://www.all.net/books/integ/goodvcase.html • M. Litzkow and M. Solomon. “Supporting Checkponting and Process Migration outside the UNIX kernel”, Usenix Conference Proceedings, San Francisco, CA, January 1992. • Gurdip Singh, “Leader election in complete networks”, PPDC 92

Implementation Arch. Worm Communicator Dispatcher Dequeuer Checkpointer Remove Checkpoint Prepend Computation Append

Parallel Programs • Communication • Connectivity across failures • Re-write send/recv socket calls • Limitations of Master-Worker Model? • Not really!

Communication • Checkpoint markers • Buffer all data between checkpoint markers • Help of master in rerouting

Civilian Worms: Ensuring Reliability in an Unreliable Environment