1 / 24

Fault Tolerant Distributed Systems

Fault Tolerant Distributed Systems. A Survey by Nirmit Desai. The track. Review the approaches to fault tolerance (coupled with mutex algorithms in some cases, 1 - 3), Get an idea of the terminology and field #1: Revannaswamy, Bhatt - ‘97 #2: Chang, Singhal, Liu - ‘90

varana
Download Presentation

Fault Tolerant Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerant Distributed Systems A Survey by Nirmit Desai

  2. The track Review the approaches to fault tolerance (coupled with mutex algorithms in some cases, 1 - 3), Get an idea of the terminology and field • #1: Revannaswamy, Bhatt - ‘97 • #2: Chang, Singhal, Liu - ‘90 • #3: Helary, Mostefaoui - ‘94 • #4: Stumm, Zhou - ‘90 • #5: Nett, Mock, Theisohn - ‘97 • #6: Ballarini, Bernardi, Donatelli - ‘02 • #7: Hardekopf, Kwiat, Upadhyaya - ‘01 • #8: Injong Rhee - ‘95 • #9: Goyer, Momtahan, Selic - ‘90 • #10: Frank Mueller - ‘01 • Summary • Questions Fault tolerant DS - a survey

  3. #1: Revannaswamy, Bhatt - ‘97 • O(log N), N = Number of nodes in the spanning tree • Communicate with neighbours only (Raymonds algorithm) • Tolerant to single link / single node failures • Eliminate the failed component and obtain different tree structure (network is biconnected) • Mechanism to detect the failures is assumed to exist • Only “Branches” are used for message exchanges Fault tolerant DS - a survey

  4. #1 Contd - Reconfiguaration • Repeat the steps below till all connected OR insufficient chords • At the end of reconfiguration, all data-structures are reset and algorithm is restarted, older token holder is pre-empted and token is transferred to a newly elected leader Fault tolerant DS - a survey

  5. #2: Chang, Singhal, Liu - ‘90 • O((2+K) * logN), Each node has K alternate paths to token node • Works with DAG, no cycles should ever develop • FIFO violation for greedy strategy Fault tolerant DS - a survey

  6. #2 Contd. - Fault Tolerance • Mechanism to detect the failures is assumed to exist • Single link / single node failure and recovery • Any path constructed by taking outgoing edges always leads to the token holder • Studies all the states of the system at which the failure may occur and gives solutions to each of the states for tolerance • At recovery, the node asks its neighbours about its state in order to reconstruct the data structures • Fairness compromized at recovery - queue order is arbitrary Fault tolerant DS - a survey

  7. #3: Helary, Mostefaoui - ‘94 • Worst case O(lg n) for mutex, average O(lg n) for fault-tolerance • Fairness compromized at recovery - queue order is arbitrary • Best of both the worlds: Naimi (dynamic), Raymond(static) • Node topology preserves binomial tree (open cube) structure • Assumes finite message delay for detection of node failure Fault tolerant DS - a survey

  8. #3 Contd. - Fault Tolerance • Handles multiple and simultaneous node failures, no link failures • Due to bounded message delays and estimated critical section lengths, a node can suspect a failure by timeouts. • Enquire appropriate nodes about their state and deduce whether the suspicion is correct. • In case of failure, some data structures are updated, tree is reconfigured Fault tolerant DS - a survey

  9. #4: Stumm, Zhou • Extension of coherent DSM algorithms for fault tolerance • Tolerance to single host failures, no link failures • Availability of broadcast (multicast) communication is assumed • Communication cost assumed to be larger than memory access costs • Critical state (DSM data, state info) replicated to multiple hosts • Central server algo: central server manages data (request/response) • Backup the central server state, fault detection by timeouts • Full replication algo: all data on each host, local reads, sequenced writes (by sequencers), simple recovery by obtaining latest copy from other hosts Fault tolerant DS - a survey

  10. #4 Contd. • Migration algo: Data always migrated to the accessing host, can be integrated with the VMM of the host OS, maintain sequential consistency • Read replication algo: Similar to migration but no migration for read accesses • Comparisons: • Fault tolerant versions of central server and full replication do not introdue a substantial overhead • Central server is better for infrequent shared data accesses • Full replication is better for sparse write accesses Fault tolerant DS - a survey

  11. #5: Nett, Mock, Theisohn • Specification and realization of a generalized distributed dependency management system for fault tolerance • Key observation: Management of dynamically evolving dependencies is a key problem for fault tolerance • Dependency graph: nodes as sites, edges as dependencies • Dependency set: set of nodes reachable from start nodes • Traversal of dependency set yields solution: at each node • The node is added to the dep. set ? • Search continues ? • Which dep. types to follow now ? • In which direction ? Fault tolerant DS - a survey

  12. #5 Contd - Example • Stability: completed actions have stable dependencies • The proposed system supports only applications that have stable dependency states • Does our application conform to this requirement ? • Next slide Fault tolerant DS - a survey

  13. #5 Contd - Language • Generate a state machine with this language and, verify for stability Fault tolerant DS - a survey

  14. #5 Contd - Example • Send messages and generate global dependency graph to look for inconsistencies Fault tolerant DS - a survey

  15. #6: Ballarini, Bernardi, Donatelli • Formal and theoretical discussion of synchronization problem in distributed systems Fault tolerant DS - a survey

  16. #7: Hardekopf, Kwiat, Upadhyaya • Security and fault tolerance for aerospace distributed communications by means of voting mechanism • Basic voting protocol: • Initiator sends a request to one of the voters • The voter multicasts a request to other voters • Voters decide locally and send their vote to the client • client waits for f+1 alike votes where f is the degree of fault tolerance • Problems: not scalable, assumes exact voting • Proposes a model of set of voters, interface module and user where majority of voters are assumed to be always correct !! Fault tolerant DS - a survey

  17. #7 Contd - The algorithm • Voters are authenticated when they commit • O(1) messages with any number of voters • Hashing and digital signatures for WAN • Each voter will follow these steps • If no other voter has committed an answer, commit self-answer and skip • If some voter has committed and the result conflict, then commit self-answer • After everyone had a chance, determine the majority result • If majority is in conflict, then go to step 1. • Interface module will receive first commit, buffer it and start the timer. • If new commit is received old is over-written and timer is restarted. • Otherwise the buffered result is output to the client Fault tolerant DS - a survey

  18. #8: Injong Rhee • Fault tolerance in distributed resource allocation by locality • Failure locality = max. number of processes dead due to failure • Depends on degree of distributedness of algorithm • Tight lower bound on failure locality is where, is the maximum number of conflicting processes • Dining philosopher problemhas the optimal failure locality • Proposed algorithm breaks the chain by not allowing a process to wait for a process which is also blocked, achieves Fault tolerant DS - a survey

  19. #8 Contd - Model • Each process is modeled in steps: send, receive, local, fail and local state • Guarded command set: [receive -> Action,...] • Configuration C is a set of current local • Network is a process too • Enabling and execution of guarded sommand set • System is initial configuration C0 and execution of guarded command set of all the processes • No constraints on relative timing of execution of process steps • Priorities by logical clocks (with waiting factor) • Centralized resource managers manage exclusion by responding to the requests for resource (with preemption) Fault tolerant DS - a survey

  20. #9: Goyer, Momtahan, Selic • Hierarchical control strategy for fault tolerance (centralized app) • Simplicity, efficiency, predictability • Transient and hard failures - recovers or doesn’t recover ? • Soft QoS guarantees in terms of response times Fault tolerant DS - a survey

  21. #10: Frank Mueller • Extension of Naimi’s protocol - single node failures • Backup communication in form of ring structure • Fault handler node: Find out the faulty node by timeout, collect fault information (from 2 nodes) and take appropriate action Fault tolerant DS - a survey

  22. Some ideas - for our protocol • The first node experiencing timeout may send a message to every other node asking for their states • The node experiencing timeout must be: waiting for a grant/token to arrive OR waiting for a release to arrive • In case of grant/token, propagate the message to the parent chain to detect the pooint of failure • In case of release, send the message down the chain of children to see who is supposed to send a release • Once detected the point, recalculate the state of the affected nodes - the ones falling on the chain • Assuming no link failures for now, multiple failures can be detected Fault tolerant DS - a survey

  23. Summary • Approaches fall into following categories: • Embedded fault tolerance into the base algorithms - inspiring, concrete, but hard to import • Generalized approaches - demonstrate strength of applicability but no concrete algorithmic description • Voting approaches - can be used to reach consensus, useful in case of multiple failures • Centralized approach - simple, predictable but may be not desirable • Replication based approaches - simple, incur overhead, sometimes the only solution (esp for complex algorithms) • No prominent solution for detecting failure - even for very specific problems/algorithms Fault tolerant DS - a survey

  24. Questions • ?? Fault tolerant DS - a survey

More Related