Building Blocks for High-Performance, Fault-Tolerant Distributed Systems

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems Nancy Lynch Theory of Distributed Systems MIT AFOSR project review Cornell University May, 2001 …

Project Participants • Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland • PhD students: Victor Luchangco, Roger Khazan, Carl Livadas, Josh Tauber, Ziv Bar-Joseph, Rui Fan • MEng: Rui Fan, Kyle Ingols, Igor Taraschanskiy, Andrej Bogdanov, Michael Tsai, Laura Dean • Collaborators: Roberto De Prisco, Jeremy Sussman, Keith Marzullo, Danny Dolev, Alan Fekete, Gregory Chockler, Roman Vitenberg

Project Scope • Define services to support high-performance distributed computing in dynamic environments: Failures, changing participants. • Design algorithms to implement the services. • Analyzealgorithms: Correctness, performance, fault-tolerance. • Develop necessary mathematical foundations: State machine models, analysis methods. • Develop supporting languagesand tools: IOA.

Talk Outline I. View-oriented group communication services II. Non-view-oriented group communication III. Mathematical foundations IV. IOA language and tools V. Memory consistency models VI. Plans

I. View-Oriented Group Communication Services

View-Oriented Group Communication Services • Cope with changing participants using abstract groups of client processes with changing membership sets. • Processes communicate with group members by sending messages to the group as a whole. • GC services support management of groups: • Maintain membership information, form views. • Manage communication. • Make guarantees about ordering, reliability of message delivery. • Isis, Transis, Totem, Ensemble,… … GC

Using View-Oriented GC Services • Advantages: • High-level programming abstraction • Hides complexity of coping with changes • Disadvantages: • Can be costly, especially when forming new views. • May have problems scaling to large networks. • Applications: • Managing replicated data • Distributed interactive games • Multi-media conferencing, collaborative work

Application Service Application Algorithm Our Approach • Mathematical, using state machines (I/O automata) • Model everything: • Applications • Service specifications • Implementations of the services • Prove correctness • Analyze performance, fault-tolerance

Our Earlier Work: VS [Fekete, Lynch, Shvartsman 97, 01] • Defined automaton models for: • VS, partitionable GC service, based on Transis • TO, non-view-oriented totally ordered bcast service • VStoTO, algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser] • Proved correctness • Analyzed performance, fault-tolerance: conditional performance analysis brcv bcast TO VStoTO VStoTO gprcv newview gpsnd VS

Conditional Performance Analysis • Assume VS satisfies: • If a network component C stabilizes, then soon thereafter, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within bounded time. • And VStoTO satisfies: • Simple timing and fault-tolerance assumptions. • Then TO satisfies: • If C stabilizes, then soon thereafter, any message sent or delivered anywhere in C is delivered everywhere in C, within bounded time.

Ensemble [Hickey, Lynch, van Renesse 99] • Ensemble system [Birman, Hayden], layered design: • Worked with developers, following VS. • Developed global specs for key layers. • Modeled Ensemble algorithm spanning between layers. • Tried proof; found algorithmic error. • Modeled, analyzed repaired system • Same error found in Horus.

More Recent Progress 1. GC with unique primary views 2. Scalable GC 3. Optimistic Virtual Synchrony 4. GC service specifications

GC With Unique Primary Views

GC With Unique Primaries • Dynamic View Service [De Prisco, Fekete, Lynch, Shvartsman 98] • Produces unique primary views • Copes with long-term changes. • Dynamic Configuration Service[DFLS 99] • Adds quorums. • Copes with long-term and transient changes. • Dynamic Leader Configuration Service[D 99], [DL 01] • Adds leaders.

GC With Unique Primaries • Algorithms to implement the services • Based on dynamic voting algorithm of [Yeger-Lotem, Keidar, Dolev 97]. • Each primary needs majority of all possible previous primaries. • Models, proofs,… • Applications • Consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes. • Models, proofs,…

Availability of Unique Primary Algorithms [Ingols, Keidar 01] • Simulation study comparing unique primary algorithms: • [Yeger-Lotem, Keidar, Dolev], [DFLS] • 1-pending, like [Jajodia, Mutchler] • Majority-resilient 1-pending, like [Lamport], [Keidar, Dolev] • Simulate repeated view changes, interrupting other view changes. • Availability shown to depend heavily on: • Number of processes from previous view needed to form new view. • Number of message rounds needed to form a view. • [YKD], [DFLS] have highest availability.

2. Scalable Group Communication [Keidar, Khazan 00]

Group Communication Service • Manages group membership, current view. • Multicast communication among group members, with ordering, reliability guarantees. • Virtual Synchrony[Birman, Joseph 87] • Integrates group membership and group communication. • Processes that move together from one view to another deliver the same messages in the first view. • Useful for replicated data management. • Before announcing new view, processes must synchronize, exchange messages.

Example: Virtual Synchrony i j k 3: i,j,k 3: i,j,k 3: i,j,k mcast(m) rcv(m) rcv(m) 4: i, j 4: i, j VS algorithm supplies missing m

VSGC Net GM Group Communication in WANs • Difficulties: • High message latency, message exchanges are expensive • Frequent connectivity changes • New, scalable GC algorithm: • Uses scalable GM service of [Keidar, Sussman, et al. 00], implemented on a small set of membership servers. • GC (with virtual synchrony) implemented on clients. VS

Group Communication in WANs • Try to minimize time from when network stabilizes until GC delivers new views to clients. • After stabilization: GM forms view, VSGC algorithm synchronizes. • Existing systems (LANs): • GM, VSGC uses several message exchange rounds • Continue in spite of new network events • Inappropriate for WANs view(v) Net event VSGC Algorithm GM Algorithm

view(v) New Algorithm • VSGC uses one message exchange round, in parallel with GM’s agreement on views. • GM usually delivers views in one message exchange. • Responds to new network events during reconfiguration: • GM produces new membership sets • VSGC responds to membership changes • Distributed implementation [Tarashchanskiy 00] Net event VSGC Algorithm GM Algorithm

S S’ A A’ Correctness Proofs • Models, proofs (safety and liveness) • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00] • Proof Extension Theorem: • Used new methods for the safety proofs.

Performance Analysis • Analyze time from when network stabilizes until GC delivers new views to clients. • System is a composition: • Network service, GM services, VSGC processes • Compositional analysis: • Analyze the VSGC algorithm alone, in terms of its inputs and timing assumptions. • State reasonable performance guarantees for GM, Network. • Combine to get conditional performance properties for the system as a whole.

Analysis of VSGC algorithm • Assume component C stabilizes: • GM delivers same views to VSGC processes • Net provides reliable communication with latency . • Let • T[start], T[view] be times of last GM events for C •  be upper bound on local step time. • Then VSGC outputs new views by time max (T[start] +  + x, T[view]) + 

view(v) view(v)  + x Analysis of VSGC Algorithm VS Algorithm Net Event start start GM algorithm T[start] T[view]

Assumed Bounds for GM T[start] T[view] • Bounds for “Fast Path” of [Keidar, et al. 00],observed empirically in almost all cases. start  start view(v)  GM

view(v) Combining VSGC and GM Bounds  + x • Bounds for system, conditional on GM bounds. VSGC start start view(v)   T[start] T[view] GM

3. Optimistic Virtual Synchrony[Sussman, Keidar, Marzullo 00] • Most GC algorithms block sending during reconfiguration. • OVS service provides: • Optimistic view proposal, before reconfiguration. • Optimistic sends after proposal, during reconfiguration. • Deliveries of optimistic messages in next view, subject to application policy. • Useful for applications: • Replicated data management • State transfer • Sending vectors of data

4. GC Service Specifications [Chockler, Keidar, Vitenberg 01] • Comprehensive set of specifications for properties guaranteed by GC services. • Unifying framework. • Safety properties • Membership: View order, partitionable, primary component • Multicast: Sending view delivery, virtual synchrony • Safe notifications • Ordering, reliability: FIFO, causal, totally ordered, atomic • Liveness properties • For eventually stable components: View stability, multicast delivery, safe notification liveness • For eventually stable pairs

II. Non-View-Oriented Group Communication Totally Ordered Multicast with QoS [Bar-Joseph, Keidar, Anker, Lynch 00, 01]

Totally Ordered Multicast with QoS • Multicast to dynamic group, subject to joins, leaves, and failures. • Global total ordering of messages • QoS: Message delivery latency • Built on reliable network with latency guarantees • Add ordering guarantees, preserve latency bounds. • Applications • State machine replication • Distributed games • Shared editing

Two Algorithms • Algorithm 1: Basic Totally Ordered Multicast • Sends, receives consistent with total ordering of messages. • Non-failing processes agree on messages from non-failing processes. • Latency: Constant, even with joins, leaves, failures. • Algorithm 2: Atomic Multicast • Non-failing processes agree on all messages. • Latency: • Joins, leaves only: Constant • With failures: Linear in f TOM fail_i fail_j Net

Local Node Process join leave rcv(m) mcast(m) Ord_i joiners(s,J), leavers(s,J) end-slot(s) FrontEnd_i members(s,J) Memb_i mcast(m) join leave mcast(join) mcast(leave) progress(s,j) Sniffer_i rcv(m) Net

Local Algorithm Operation • FrontEnd divides time into slots, tags messages with slots. • Ord delivers messages by slot, in order of process indices. • Memb determines slot membership. • Join, leave messages • Failures: • Algorithm 1 uses local failure detector. • Algorithm 2 uses consensus on failures. • Requires new dynamic version of consensus. • Timing-dependent

Net GM Architecture for Algorithm 2 TO-QoS

2. Scalable Reliable Multicast Services [Livadas 01]

SRM[Floyd, et al.] • Reliable multicast to dynamic group. • Built over IP multicast • Based on requests (NACKs) and retransmissions • Limits duplicate requests/retransmissions using: • Deterministic suppression: Ancestors suppress descendants, by scheduling requests/replies based on distance to source. • Probabilistic suppression: Siblings suppress each other, by spreading out requests/replies.

SRM Architecture SRM IPMcast

New Protocol • Inspired by SRM • Assume future losses occur on same link (locality). • Uses deterministic suppression for siblings • Elects, caches best requestor and retransmitter • Chooses requestor closest to source. • Chooses retransmitter closest to requestor. • Break ties with processor ids.

Best Requestor and Retransmitter S Retransmitter Requestor

Performance Analysis • Metrics: • Loss recovery latency: Time from detection of packet loss to receipt of first retransmission • Loss recovery overhead: Number of messages multicast to recover from a message loss • Protocol performance benefits: • Removes delays caused by probabilistic suppression • Following election of requestor and retransmitter: • Reduces latency by using best requestor and retransmitter. • Reduces overhead by using single requestor and retransmitter.

Incremental modeling and proof methods [Keidar, Khazan, Lynch, Shvartsman 00] Proof Extension Theorem Arose in Scalable GC work [Keidar, Khazan 00] Hybrid Input/Output Automata [Lynch, Segala, Vaandrager 01] Model for continuous and discrete system behavior Useful for mobile computing? Conditional performance analysis methods For analyzing communication protocols AFOSR MURI project (Berkeley) III. Mathematical Foundations

I A O IV. IOA Language and Tools

I A O IOA Language and Tools • Language for describing I/O automata: Garland, Lynch • Use to describe services and algorithms. • Front end:Garland • Translates to Java objects • Completely rewritten this year. • Still needs support for composition. • Theorem-prover connection: Garland, Bogdanov • Connection with LP • Seeking connections: SAL, Isabelle, STeP, NuPRL

IOA Language and Tools • Simulator:Chefter, Ramirez, Dean • Has support for paired simulation. • Needs additions. • Being instrumented for invariant discovery using Ernst’s Daikon tool • Code generator:Tauber, Tsai • Local code-gen (translation to Java) running. • Needs composition, communication service calls, correctness proof. • Challenge examples

V. Multiprocessor Memory Models [Luchangco 01]

read write P2 P1 Pn Memory Memory Models • Establishes a general mathematical framework for specifying and reasoning about multiprocessor memories and the programs that use them. • Also applies to distributed shared memory.

Memory Models • Sequentially consistent memory: • Operations appear to happen in some sequential order. • Read operation returns latest value written to the location. • Processor consistent memory: • Reads overtake writes to other locations. • SPARC TSO, IBM 370 • Coherent memory • Memory with synchronization commands: • Fences, barriers, acquire/release,… • Release consistency, weak ordering, locking • Transactional memory

Programming restrictions: • Data-race-free (for use with weak ordering) • Properly labelled (for use with release consistency) • Two-phase locking

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems

Presentation Transcript

Fault Tolerant Distributed Systems

Distributed systems II Fault-Tolerant AGREEMENT

Distributed System Building Blocks

Distributed systems II Fault-Tolerant AGREEMENT

12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems

High-Performance, Low Fault-Tolerant Schools

Fault Tolerant Distributed Computing system.

Fault Tolerant Design of Distributed Automotive Systems

Optimal Recovery Schemes for Fault Tolerant Distributed Real-Time Systems

Fault Tolerant, High Performance Computing Payload for Space Missions

Distributed systems II Fault-Tolerant Broadcast ( cnt .)

ITEC452 Distributed Computing Lecture 11 Fault Tolerant Systems

Distributed systems II Fault-Tolerant Broadcast

Building Fault-Tolerant Enterprise Applications

High-Performance, Low Fault-Tolerant Schools

Distributed systems II Fault-Tolerant AGREEMENT

Fault Tolerant Distributed Computing system.

Distributed systems II Fault-Tolerant Broadcast

Distributed systems II Fault-Tolerant AGREEMENT

Synthesis of Fault-Tolerant Distributed Programs

12. Experimental Evaluation 18-749: Fault-Tolerant Distributed Systems