Distributed Systems CS 15-440

Distributed SystemsCS 15-440 Fault Tolerance- Part II Lecture 23, Nov 19, 2014 Mohammad Hammoud

Today… • Last Session: • Quiz 2 • Today’s Session: • Fault Tolerance – Part II • Reliable communication • Announcements: • Project 4 is due on Dec 3rd by midnight • PS5will be posted by tonight. It is due on Dec 4th by midnight

Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance

Reliable Communication • Fault tolerance in distributed systems typically concentrates on faulty processes • However, we also need to consider communication failures • We will focus on two types of reliable communication: • Reliable request-reply communication (e.g., RPC) • Reliable group communication (e.g., multicasting schemes) P1 P0

Reliable Communication Reliable Communication Reliable Request-Reply Communication Reliable Group Communication Reliable Group Communication

Request-Reply Communication • The request-reply (RR) communication is designed to support the roles and message exchanges in typical client-server interactions • This sort of communication is mainly based on a trio of communication primitives, doOperation, getRequest and sendReply Client Server Request Message • doOperation • (wait) • (continuation) getRequest select operation execute operation sendReply Reply Message

Timeout Mechanisms • Request-reply communication may suffer from crash, omission, timing, and byzantinefailures • To allow for occasions where a request or a reply message is not delivered (e.g., lost), doOperation uses a timeout mechanism • There are various options as to what doOperation can do after a timeout: • Return immediately with an indication to the client that the request has failed • Send the request message repeatedly until either a reply is received or the server is assumed to have failed

Idempotent Operations • In cases when the request message is retransmitted, the server may receive it more than once • This can cause the server executing an operation more than once for the same request • Not every operation can be executed more than once and obtain the same results each time • Operations that can be executed repeatedly with the same effect are called idempotent operations

Duplicate Filtering • To avoid problems with non-idempotent operations, the server should recognize successive messages from the same client and filter out duplicates • If the server has already sent the reply when it receives a “duplicate” request, it can either: • Re-execute the operation again to obtain the result (only for idempotent operations) • Or do not re-execute the operation if it has chosen to retain the outcome of the first and only execution

Keeping History • Servers can maintain the execution outcomes of requests in what is called the history • More precisely, the term ‘history’ is used to refer to a structure that contains records of (reply) messages that have been transmitted Request ID Message Client ID Fields of a history record:

Managing History • The server can interpret each request from a client as an ACK of its previous reply • Thus, the history needs contain ONLY the last reply message sent to each client • But, if the number of clients is large, memory cost might become a problem • Messages in a history are normally discarded after a limited period of time

In Summary… • RR protocol can be implemented in different ways to provide different delivery guarantees. The main choices are: • Retry request message (client side): Controls whether to retransmit the request message until either a reply is received or the server is assumed to have failed • Duplicate filtering (server side): Controls when retransmissions are used and whether to filter out duplicate requests at the server • Retransmission of results (server side): Controls whether to keep a history of result messages to enable lost results to be retransmitted without re-executing the operations at the server

Request-Reply Call Semantics • Combinations of request-reply protocols lead to a variety of possible semantics for the reliability of remote invocations

Reliable Communication Reliable Communication Reliable Request-Reply Communication Reliable Group Communication

Reliable Group Communication • As we considered reliable request-reply communication, we need also to consider reliable multicasting services • E.g., Election algorithms use multicasting schemes 1 2 7 3 6 4 5

Reliable Group Communication • A Basic Reliable-Multicasting Scheme • Atomic Multicasting

Reliable Multicasting • Reliable multicasting indicates that a message that is sent to a group of processes should be delivered to each member of that group • A distinction should be made between: • Reliable communication in the presence of faulty processes • Reliable communication when processes are assumed to operate correctly • In the presence of faulty processes, multicasting is considered to be reliable when it can be guaranteed that all non-faulty group members receive the message

Basic Reliable Multicasting Questions • What happens if during multicasting a process P joins or leaves a group? • Should the sent message be delivered? • Should P (if joining) also receive the message? • What happens if the (sending) process crashes during multicasting? • What about message ordering?

A Simple Case: Reliable Multicasting with Feedback Messages • Consider the case when a single sender S wants to multicast a message to multiple receivers • An S’s multi-casted message may be lost part way and delivered to some, but not to all, of the intended receivers • Assume that messages are received in the same order as they are sent

Reliable Multicasting with Feedback Messages Receiver Sender Receiver Receiver Receiver M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 History Buffer Last = 24 Last = 24 Last = 23 Last = 24 Network Receiver Sender Receiver Receiver Receiver Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 ACK25 ACK25 Missed 24 ACK25 An extensive and detailed survey of total-order broadcasts can be found in Defago et al. (2004)

Reliable Group Communication • A Basic Reliable-Multicasting Scheme • Atomic Multicasting

Atomic Multicast • C1: What is often needed in a distributed system is the guarantee that a message is delivered to either all processes or none at all • C2: It is also generally required that all messages are delivered in the same order to all processes • Satisfying C1 and C2 results in what we call atomic multicast • Atomic multicast: • Ensures that non-faulty processes maintain a consistent view • Forces reconciliation when a process recovers and rejoins the group

Virtual Synchrony • A multicast message m is uniquely associated with a list of processes to which it should be delivered • This delivery list corresponds to a group view (G) • In principle, the delivery of m is allowed to fail: • When a group-membership-change is the result of the sender of m crashing • Accordingly, m may either be delivered to all remaining processes, or ignored by each of them • Or when a group-membership-change is the result of a receiver of m crashing • Accordingly, m may be ignored by every other receiver-- which corresponds to the situation that the sender of m crashed before m was sent • A reliable multicast with this property is said to be “virtually synchronous”

The Principle of Virtual Synchrony Reliable multicast by multiple point-to-point messages P3 crashes P3 rejoins P1 P2 P3 P4 Time G = {P1, P2, P4} G = {P1, P2, P3, P4} G = {P1, P2, P3, P4} Partial multicast from P3 is discarded

Message Ordering • Four different virtually synchronous multicast orderings are distinguished: • Unordered multicasts • FIFO-ordered multicasts • Causally-ordered multicasts • Totally-ordered multicasts

1. Unordered multicasts • A reliable, unordered multicast is a virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are delivered by different processes Three communicating processes in the same group

2. FIFO-Ordered Multicasts • With FIFO-Ordered multicasts, the communication layer is forced to deliver incoming messages from the same process in the same order as they have been sent Four processes in the same group with two different senders.

3-4. Causally-Ordered and Total-Ordered Multicasts • Causally-ordered multicasts preserve potential causality between different messages • If message m1 causally precedes another message m2, regardless of whether they were multicast by the same sender or not, the communication layer at each receiver will always deliver m1 before m2 • Total-ordered multicasts require that when messages are delivered, they are delivered in the same order to all group members (regardless of whether message delivery is unordered, FIFO-ordered, or causally-ordered)

Virtually Synchronous Reliable Multicasting • A virtually synchronous reliable multicasting that offers total-ordered delivery of messages is what we refer to as atomic multicasting Six different versions of virtually synchronous reliable multicasting

Distributed Commit • Atomic multicasting problem is an example of a more general problem, known as distributed commit • The distributed commit problem involves having an operation being performed by each member of a process group, or none at all • With reliable multicasting, the operation is the delivery of a message • With distributed transactions, the operation may be the commit of a transaction at a single site that takes part in the transaction • Distributed commit is often established by means of a coordinatorandparticipants

One-Phase Commit Protocol • In a simple scheme, a coordinator can tell all participants whether or not to (locally) perform the operation in question • This scheme is referred to as a one-phase commit protocol • The one-phase commit protocol has a main drawback that if one of the participants cannot actually perform the operation, there is no way to tell the coordinator • In practice, more sophisticated schemes are needed • The most common utilized one is the two-phase commit protocol

Two-Phase Commit Protocol • Assuming that no failures occur, the two-phase commit protocol (2PC) consists of the following two phases, each consisting of two steps:

Two-Phase Commit Protocol

2PC Finite State Machines Vote-request Vote-abort INIT INIT Commit Vote-request Vote-request Vote-commit WAIT WAIT Vote-abort Global-abort Global-abort ACK Vote-commit Global-commit Global-commit ACK ABORT COMMIT ABORT COMMIT The finite state machine for the coordinator in 2PC The finite state machine for a participant in 2PC

2PC Algorithm Actions by coordinator: write START_2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected{ wait for any incoming vote; if timeout{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } If all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; }else{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; }

Two-Phase Commit Protocol Actions by participants: write INIT to local log; Wait for VOTE_REQUEST from coordinator; If timeout{ write VOTE_ABORT to local log; exit; } If participant votes COMMIT{ write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout{ multicast DECISION_RQUEST to other participants; wait until DECISION is received; /*remain blocked*/ write DECISION to local log; } if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;} else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log}; }else{ write VOTE_ABORT to local log; send VOTE_ABORT to coordinator; }

Two-Phase Commit Protocol Actions for handling decision requests: /*executed by separate thread*/ while true{ wait until any incoming DECISION_REQUEST is received; /*remain blocked*/ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /*participant remains blocked*/ }

Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance

Recovery • So far, we have mainly concentrated on algorithms that allow us to tolerate faults • However, once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state • In what follows we focus on: • What it actually means to recover to a correct state • When and how the state of a distributed system can be recorded and recovered, by means of checkpointing and message logging

Recovery • Error Recovery • Checkpointing • Message Logging

Error Recovery • Once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state • Fundamental to fault tolerance is the recovery from an error • The idea of error recovery is to replace an erroneous state with an error-free state • There are essentially two forms of error recovery: • Backward recovery • Forward recovery

Backward Recovery • In backward recovery, the main issue is to bring the system from its present erroneous state “back” to a previously correct state • It is necessary to record the system’s state from time to time onto a stable storage, and to restore such a recorded state when things go wrong • Each time (part of) the system’s present state is recorded, a checkpoint is said to be made • Some problems with backward recovery: • Restoring a system or a process to a previous state is generally expensive (in terms of performance) • Some states can never be rolled back (e.g., typing in UNIX rm –fr *)

Forward Recovery • When the system detects that it has made an error, forward recovery revertsthe system state to error time and corrects it, to be able to move forward • Forward recovery is typically faster than backward recovery but requires that it has to be known in advance which errors may occur • Some systems make use of both forward and backward recovery for different errors or different parts of one error

Recovery • Error Recovery • Checkpointing • Message Logging

Why Checkpointing? • In fault-tolerant distributed systems, backward recovery requires that systems “regularly” save their states onto stable storages • This process is referred to as checkpointing • Checkpointing consists of storing a “distributed snapshot” of the current application state, and later on, use it for restarting the execution in case of afailure

Recovery Line • In capturing a distributed snapshot, if a process P has recorded the receipt of a message, m, then there should be also a process Q that has recorded the sending of m We are able to identify both, senders and receivers. A snapshot Initial state A recovery line Not a recovery line P m A failure Q Message sent from Q to P They jointly form a distributed snapshot

Checkpointing • Checkpointing can be of two types: • Independent Checkpointing: each process simply records its local state from time to time in an uncoordinated fashion • Coordinated Checkpointing: all processes synchronize to jointly write their states to local stable storages • Which algorithm among the ones we’ve studied can be used to implement coordinated checkpointing? • A simple solution is to use 2PC

Domino Effect • Independent checkpointing may make it difficult to find a recovery line, leading potentially to a domino effect resulting from cascaded rollbacks • With coordinated checkpointing, the saved state is automatically globally consistent, hence, domino effect is inherently avoided Rollback Not a Recovery Line Not a Recovery Line Not a Recovery Line P A failure Q

Distributed Systems CS 15-440