680 likes | 815 Views
Replicated Distributed Systems. By Eric C. Cooper. Overview. Introduction and Background (Queenie) A Model of Replicated Distributed Programs Implementing Distributed Modules and Threads Implementing Replicated Procedure Calls (Alix) Performance Analysis Concurrency Control (Joanne)
E N D
Replicated Distributed Systems By Eric C. Cooper
Overview • Introduction and Background (Queenie) • A Model of Replicated Distributed Programs • Implementing Distributed Modules and Threads • Implementing Replicated Procedure Calls (Alix) • Performance Analysis • Concurrency Control (Joanne) • Binding Agents • Troupe Reconfiguration
Background • Present a new software architecture for fault-tolerant distributed programs • Designed by Eric C. Cooper • A co-founder of FORE systems – a leader supplier of networks for enterprise and service providers
Introduction • Goal: address the problem of constructing highly available distributed programs • Tolerate crashes of the underlying hardware automatically • Continues to operate despite failure of its components • First approach: replicate each components of the system • By von Neumann (1955) • Drawback: costly - use reliable hardware everywhere
Introduction (contd) Eric C. Cooper’s new approach: • Replication on per-module basis • Flexible & not burdening the programmer • Provide location and replication transparency to programmer • Fundamental mechanism • Troupes – a replicated module • Troupe members - replicas • Replicated procedure call (many-to-many communication between troupes)
Introduction (contd) • Important Properties give this mechanism flexibility and power: • individual members of a troupe do not communicate among themselves • unaware of one another’s existence • each troupe member behaves as no replicas
A Model of Replicated Distributed Programs (contd) A model of replicated distributed program: Replicated Distributed Program State information module Troupe Procedure
A Model of Replicated Distributed Programs (contd) • Module • Package the procedure and state information which is needed to implement a particular abstraction • Separate the interface to that abstraction from its implementation • Express the static structure of a program when it is written
A Model of Replicated Distributed Programs (contd) • Threads • A thread ID – unique identifier • Particular thread runs in exactly one module at a given time • Multiple threads may be running in the same module concurrently
Implementing Distributed Modules and Threads • No machine boundaries • Provide location transparency – the programmer don’t need to know the eventual configuration of a program • Module • implemented by a server whose address space contains the module’s procedure and data • Thread • implemented by using remote procedure calls to transfer control from server to server
Adding Replication • Processor and network failure of the distributed program • Partial failures • Solution: replication • Introduce replication transparency at the module level
Adding Replication (contd) • Assumption: troupe members execute on fail-stop processors • If not => complex agreement • Replication transparency in troupe model is guaranteed by: • All troupes are deterministic • (same input → same output)
Troupe Consistency • When all its members are in the same state • => A troupe is consistent => Its clients don’t need to know that is replicated • Replication transparency
Call P Call P Call P P: proc P: proc P: proc Troupe Consistency (contd) Execution of a remote procedure call (I) Server Client
Call P Call P Call P P: proc P: proc P: proc Troupe Consistency (contd) Execution of a remote procedure call (II) Server Client
Execution of Procedure call • As a tree of procedure invocations • The invocation trees rooted at each troupe member are identical • The server troupe make the same procedure calls and returns with the same arguments and results • All troupes are initially consistent All troupes remain consistent
Replicated Procedure Calls • Goal: allow distributed programs to be written in the same as conventional programs for centralized computers • Replicated procedure call is Remote procedure call • Exactly-once execution at all troupe members
Circus Paired Message Protocol • Characteristics: • Paired messages (e.g. call and return) • Reliably delivered • Variable length • Call sequence numbers • Based on the RPC • Use UDP, the DARPA User Datagram Protocol • Connectionless but retransmission
Implementing Replicated Procedure Calls • Implemented on top of the paired message layer • Two subalgorithms in the many-to-many call • One-to-many • Many-to-one • Implemented as part of the run-time system that is linked with each user’s program
One-to-many calls • Client half of RPC performs a one-to-many call • Purpose is to guarantee that the procedure is executed at each server troupe member • Same call message with the same call number • Client waits for return messages • Waits for all the return messages before proceeding in Circus
Synchronization Point • After all the server troupe members have returned • Each client troupe member knows that all server troupe members have performed the procedure • Each server troupe member knows that all client troupe members have received the result
Many-to-one calls • Server will receive call messages from each client troupe member • Server executes the procedure only once • Returns the results to all the client troupe members • Two problems • Distinguishing between unrelated call messages • How many other call messages are expected? • Circus waits for all clients to send a call message before proceeding
Many-to-many calls • A replicated procedure call is called a many-to-many call from a client troupe to a server troupe
Many-to-many steps • A call message is sent from each client troupe member to each server troupe member. • A call message is received by each server troupe member from each client server troupe member. • The requested procedure is run on each server troupe member. • A return message is sent from each server troupe member to each client troupe member. • A return message is received by each client troupe member from each server troupe member.
Multicast Implementation • Dramatic difference in efficiency • Suppose m client troupe members and n server troupe members • Point-to-point • mn messages sent • Multicast • m+n messages sent
Waiting for messages to arrive • Troupes are assumed to be deterministic, therefore all messages are assumed to be identical • When should computation proceed? • As soon as the first messages arrives or only after the entire set arrives?
Waiting for all messages • Able provide error detection and error correction • Inconsistencies are detected • Execution time determined by the slowest member of each troupe • Default in Circus system
First-come approach • Forfeit error detection • Computation proceeds as soon as the first message in each set arrives • Execution time is determined by the fastest member of each troupe • Requires a simple change to the one-to-many call protocol • Client can use call sequence number to discard return messages from slow server troupe members
First-come approach • More complicated changes required in the many-to-one call protocol • When a call message from another member arrives, the server cannot execute the procedure again • Would violate exactly-once execution • Server must retain the return messages until all other call messages have been received from the client troupe members • Return messages is sent when the call is received • Execution seems instantaneous to the client
A better first come approach • Buffer messages at the client rather than at the server • Server broadcasts return messages to the entire client troupe after the first call message • A client troupe member may receive a return message before sending the call message • Return message is retained until the client troupe member is ready to send the call message
Advantages of buffering at client • Work of buffering return messages and pairing them with call messages is placed on the client rather than a shared server • The server can broadcast rather than point-to-point communication • No communication is required by a slow client
What about error detection? • To provide error detection and still allow computation to proceed, a watchdog scheme can be used • Create another thread of control after the first message is received • This thread will watch for remaining messages and compare • If there is an inconsistency, the main computation is aborted
Crashes and Partitions • Underlying message protocol uses probing and timeouts to detect crashes • Relies on network connectivity and therefore cannot distinguish between crashes and network partitions • To prevent troupe members from diverging • Require that each troupe member receives majority of expected set of messages
Collators • Can relax the determinism requirement by allowing programmers to reduce a set of messages into a single message • A collator maps a set of messages into a single result • Collator needs enough messages to make a decision • Three kinds • Unanimous • Majority • First come
Performance Analysis • Experiments conducted at Berkeley during an inter-semester break • Measured the cost of replicated procedure calls as a function of the degree of replication • UDP and TCP echo tests used as a comparison
Performance Analysis • Performance of UDP, TCP and Circus • TCP echo test faster than UDP echo test • Cost of TCP connection establishment ignored • UDP test makes two alarm calls and therefore two settimer calls • Read and Write interface to TCP more streamlined
Performance Analysis • Unreplicated Circus remote procedure call requires almost twice the amount of time as a simple UDP exchange • Due to extra system calls require to handle Circus • Elaborate code to handle multi-homed machines • Some Berkeley machines had as many as 4 network addresses • Design oversight by Berkeley, not a fundamental problem
Performance Analysis • Expense of a replicated procedure call increments linearly as the degree of replication increases • Each additional troupe member adds between 10-20 milliseconds • Smaller than the time for a UDP datagram exchange
Performance Analysis • Execution profiling tool used to analyze Circus implementation in finer detail • 6 Berkeley 4.2BSD system calls account for more than ½ the total CPU time to perform a replicated call • Most of the time required for a Circus replicated procedure call is spent in the simulation of multicasting
Concurrency Control • Server troupe controls calls from different clients using multiple threads • Conflicts arise when concurrent calls need to access the same resource
Concurrency Control • Serialization at each troupe member • Local concurrency control algorithms • Serialization in the same order among members • Preserve troupe consistency • Need coordination between replicated procedure calls mechanism and synchronization mechanism • => Replicated Transactions
Replicated Transactions • Requirements • Serializability • Atomicity • Ensure that aborting a transaction does not affect other concurrently executed transactions • Two-phase locking with unanimous update • Drawback: too strict • Troupe Commit Protocol
Troupe Commit Protocol • Before a server troupe member commits (or aborts) a transaction, it invokes the ready_to_commit remote procedure call to the client troupe – call-back • Client troupe returns whether it agrees to commit (or abort) the transaction • If server troupe members serialize transactions in different order, a deadlock will occur • Detecting conflicting transactions is converted to deadlock detection