320 likes | 341 Views
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. Fred B. Schneider Presenter: Aly Farahat. Contents. Introduction State Machines Fault-Tolerance Agreement & Order Logical Clocks Synchronized Clocks Server Side Ordering Faulty Output Devices
E N D
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat CS5090
Contents • Introduction • State Machines • Fault-Tolerance • Agreement & Order • Logical Clocks • Synchronized Clocks • Server Side Ordering • Faulty Output Devices • Faulty Clients • Using Time to Make Requests • Reconfiguration • Managing Reconfiguration • Integrating Repaired Replicas CS5090
Client/Server Model CS5090
Fault Types • Fail Stop Faults: a faulty component enters a predefined state and halts • Byzantine Faults: arbitrary malicious faults Q: Why do we need logic for programs? CS5090
Fault Tolerance • Based on the concept of Replication • t- tolerant: system delivers correct service up to a failure of t components • Identical Replicas of Server • t+1 for Fail Stop faults • 2t+1 for Byzantine faults Q: What kind of fault tolerance is this? What types of faults it can tolerate? CS5090
Replication Scheme CS5090
State Machine Model • Each Server Replica is an identical state machine • State Machines are Request Driven Machines and cannot progress on their own • A client Issues a Request to the State Machine CS5090
State Machine Behavior with respect to clients • O1: Requests Issued by a single client should be processed in the same order they were issued • O2: If a request r2 is causally related to r1, r1 should be processed before r2 CS5090
Example Q: Find the analogy between state machine in this context and FSM used in sequential circuits synthesis CS5090
Agreement and Order • Coordination is necessary to assure O1 and O2 • Agreement: All Replicas agree upon the value of request they should process • Order: All Replicas should process requests in the same order (agree on order of requests) • Stable Request: a request whose value and order are agreed among Replicas CS5090
Agreement • IC1: All nonfaulty processors agree on the same value • IC2: If the transmitter is nonfaulty, all nonfaulty processors use its value as the one on which they agree Q: How to determine faulty processors assuming a byzantine fault model? CS5090
Order and Stability • Order: all replicas process the requests in the same order • Stability: a property of a request, meaning that it is in the correct order • Protocols: • Logical Clocks • Synchronized Clocks • Server Side Identification Q: Suggest a scenario for an out of order request reception CS5090
Logical Clocks CS5090
Stability Test • r is stable at a replica if for a new request r’ from every client, T(r) < T(r’): ( T: returns the logical clock value appended to a request) • As unbounded delays of messages are accepted, agreement in the case of Byzantine faults is impossible CS5090
Synchronized Real-Time Clocks • Each Processor has a real-time clock synchronized with all other processors clocks. • Upper bounds on request delays guarantee order in the case of Byzantine failures CS5090
Stability Test • 1- Replica waits to guarantee no reception of requests: disadvantage (Replica has to wait) • 2- Check for a request from every client with a larger identifier • In practice the disjunction of both tests is used Q: How Byzantine Failures are handled in this case? CS5090
Replica Generated Identifiers • Advantage: not all processors need to communicate • Phase 1: each replica proposes a unique ID for the received request, a request is seen in this case • Phase 2: all replicas agree upon the request ID, the request is accepted in this case CS5090
Requirements for Stability Agreement • Stability Test: For all received request r’ from every client, their candidate identifiers should be strictly greater than an accepted request r CS5090
Generating Unique Identifiers Q: What is the significance of i/N term? CS5090
Tolerating Faulty Output Devices • Outputs Used Outside the System • Replicate Output Devices • Replicate Voters • Outputs Used Inside the system • Outputs go back to Clients • Each Client has a voter inside it CS5090
Tolerating Faulty Clients • Replication • Server State Machine Modification • Voter Inside the State Machine • Requests having same content but different identifiers • Requests having different content and identifiers Q: How a voter failure inside server is handled? CS5090
Defensive Programming • Replicas are not always possible • Lack of hardware • Application Semantics do not allow replication • Defensive Programming: additional requirements on state machines to prevent some possibly destructive actions from a faulty client • Examples: • Memory Partitioning and prevention of shared access • Bounded time shared resources by using scheduled requests on the server side CS5090
Timed Requests • Pro: No need to transmit requests • Con: Does not have parameters • Default Request: Executes on time at the server unless the client sends a different request CS5090
Reconfiguration CS5090
C, O and S • A configuration is a Triplet <C,O,S> • C: the set of operational clients • O: the set of operational output devices • S: the set of operational state machine replicas • C and O are needed by the state machine replicas • S is needed by the agreement protocol CS5090
Configurators • Manages a single object in C, O or S • Detects failures and repairs of this objects • Are clients by themselves • Issue requests of reconfiguration to State Machine Replicas • State machine use application dependent mechanisms for failure detection CS5090
Note The Next Slides are adapted from a presentation by Leon Traille From Georgia Tech For a presenatation of the same paper CS5090
Integrating a Repaired Object • e[ri]:the state that a non-faulty system element e should be after processing requests r0 through ri • An element joining the configuration immediately after request rjoin must be in state e[rjoin] before it can participate • Fail-stop failures • output device : e[rjoin] is likely to be a small amount of setup information that can be provided by state variables of smi • a client : e[rjoin] is frequently based on previous sensor values and can be determined by information from other clients • a state machine replica :the information for e[rjoin] is stored in state variables and pending requests at smi • Byzantine failures • require t + 1 replicas instead of just one CS5090
Integration with Logical Clocks • Integrating element e by state machine replica smi at request rjoin • Fail-stop processors If e is client or e is output device then send any relevant portion of state variables to e before sending any output produced by requests with unique identifiers larger than the one on rjoin If e is state machine replica smnew then 1) send the values of its state variables and copies of any pending requests to smnew 2) send to smnew every subsequent received from each client c such that uid(r) < uid(rc) where rc is the first request smnew received directly from c after being restarted • Byzantine failures • Because information from smi might be incorrect t + 1 copies of identical state information and t + 1 copies of relayed messages must be obtained CS5090
Integration with Real-time Clocks • Integrating element e by state machine replica smi at request rjoin • Fail-stop processors If e is client or e is output device then send relevant portions of its state variables to e before sending any output produced by requests with unique identifiers larger than the one on rjoin If e is state machine replica smnew then 1) send the values of its state variables and copies of any pending requests to smi 2) send to smnew every request received during the next interval of duration Δ • Byzantine failures • Because information from smi might be incorrect t + 1 copies of identical state information and t + 1 copies of relayed messages must be obtained CS5090
Stability Test During Restart • Relaying of messages break the stability tests • A request r may be received directly from client c but later a request r’, also from c, is relayed by smi with uid(r) > uid(r’) • Solution: must consider requests from c as stable only after no relayed requests from c can arrive • Stability Test During Restart: A request r received directly from a client c by restarting state machine replica smnew is stable only after the last request from c relayed by another processor has been received by smnew CS5090
Thank you! CS5090