530 likes | 653 Views
Building Blocks for High-Performance, Fault-Tolerant Distributed Systems. Nancy Lynch Theory of Distributed Systems MIT AFOSR project review Cornell University May, 2001. …. Project Participants. Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland
E N D
Building Blocks for High-Performance, Fault-Tolerant Distributed Systems Nancy Lynch Theory of Distributed Systems MIT AFOSR project review Cornell University May, 2001 …
Project Participants • Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland • PhD students: Victor Luchangco, Roger Khazan, Carl Livadas, Josh Tauber, Ziv Bar-Joseph, Rui Fan • MEng: Rui Fan, Kyle Ingols, Igor Taraschanskiy, Andrej Bogdanov, Michael Tsai, Laura Dean • Collaborators: Roberto De Prisco, Jeremy Sussman, Keith Marzullo, Danny Dolev, Alan Fekete, Gregory Chockler, Roman Vitenberg
Project Scope • Define services to support high-performance distributed computing in dynamic environments: Failures, changing participants. • Design algorithms to implement the services. • Analyzealgorithms: Correctness, performance, fault-tolerance. • Develop necessary mathematical foundations: State machine models, analysis methods. • Develop supporting languagesand tools: IOA.
Talk Outline I. View-oriented group communication services II. Non-view-oriented group communication III. Mathematical foundations IV. IOA language and tools V. Memory consistency models VI. Plans
View-Oriented Group Communication Services • Cope with changing participants using abstract groups of client processes with changing membership sets. • Processes communicate with group members by sending messages to the group as a whole. • GC services support management of groups: • Maintain membership information, form views. • Manage communication. • Make guarantees about ordering, reliability of message delivery. • Isis, Transis, Totem, Ensemble,… … GC
Using View-Oriented GC Services • Advantages: • High-level programming abstraction • Hides complexity of coping with changes • Disadvantages: • Can be costly, especially when forming new views. • May have problems scaling to large networks. • Applications: • Managing replicated data • Distributed interactive games • Multi-media conferencing, collaborative work
Application Service Application Algorithm Our Approach • Mathematical, using state machines (I/O automata) • Model everything: • Applications • Service specifications • Implementations of the services • Prove correctness • Analyze performance, fault-tolerance
Our Earlier Work: VS [Fekete, Lynch, Shvartsman 97, 01] • Defined automaton models for: • VS, partitionable GC service, based on Transis • TO, non-view-oriented totally ordered bcast service • VStoTO, algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser] • Proved correctness • Analyzed performance, fault-tolerance: conditional performance analysis brcv bcast TO VStoTO VStoTO gprcv newview gpsnd VS
Conditional Performance Analysis • Assume VS satisfies: • If a network component C stabilizes, then soon thereafter, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within bounded time. • And VStoTO satisfies: • Simple timing and fault-tolerance assumptions. • Then TO satisfies: • If C stabilizes, then soon thereafter, any message sent or delivered anywhere in C is delivered everywhere in C, within bounded time.
Ensemble [Hickey, Lynch, van Renesse 99] • Ensemble system [Birman, Hayden], layered design: • Worked with developers, following VS. • Developed global specs for key layers. • Modeled Ensemble algorithm spanning between layers. • Tried proof; found algorithmic error. • Modeled, analyzed repaired system • Same error found in Horus.
More Recent Progress 1. GC with unique primary views 2. Scalable GC 3. Optimistic Virtual Synchrony 4. GC service specifications
GC With Unique Primaries • Dynamic View Service [De Prisco, Fekete, Lynch, Shvartsman 98] • Produces unique primary views • Copes with long-term changes. • Dynamic Configuration Service[DFLS 99] • Adds quorums. • Copes with long-term and transient changes. • Dynamic Leader Configuration Service[D 99], [DL 01] • Adds leaders.
GC With Unique Primaries • Algorithms to implement the services • Based on dynamic voting algorithm of [Yeger-Lotem, Keidar, Dolev 97]. • Each primary needs majority of all possible previous primaries. • Models, proofs,… • Applications • Consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes. • Models, proofs,…
Availability of Unique Primary Algorithms [Ingols, Keidar 01] • Simulation study comparing unique primary algorithms: • [Yeger-Lotem, Keidar, Dolev], [DFLS] • 1-pending, like [Jajodia, Mutchler] • Majority-resilient 1-pending, like [Lamport], [Keidar, Dolev] • Simulate repeated view changes, interrupting other view changes. • Availability shown to depend heavily on: • Number of processes from previous view needed to form new view. • Number of message rounds needed to form a view. • [YKD], [DFLS] have highest availability.
Group Communication Service • Manages group membership, current view. • Multicast communication among group members, with ordering, reliability guarantees. • Virtual Synchrony[Birman, Joseph 87] • Integrates group membership and group communication. • Processes that move together from one view to another deliver the same messages in the first view. • Useful for replicated data management. • Before announcing new view, processes must synchronize, exchange messages.
Example: Virtual Synchrony i j k 3: i,j,k 3: i,j,k 3: i,j,k mcast(m) rcv(m) rcv(m) 4: i, j 4: i, j VS algorithm supplies missing m
VSGC Net GM Group Communication in WANs • Difficulties: • High message latency, message exchanges are expensive • Frequent connectivity changes • New, scalable GC algorithm: • Uses scalable GM service of [Keidar, Sussman, et al. 00], implemented on a small set of membership servers. • GC (with virtual synchrony) implemented on clients. VS
Group Communication in WANs • Try to minimize time from when network stabilizes until GC delivers new views to clients. • After stabilization: GM forms view, VSGC algorithm synchronizes. • Existing systems (LANs): • GM, VSGC uses several message exchange rounds • Continue in spite of new network events • Inappropriate for WANs view(v) Net event VSGC Algorithm GM Algorithm
view(v) New Algorithm • VSGC uses one message exchange round, in parallel with GM’s agreement on views. • GM usually delivers views in one message exchange. • Responds to new network events during reconfiguration: • GM produces new membership sets • VSGC responds to membership changes • Distributed implementation [Tarashchanskiy 00] Net event VSGC Algorithm GM Algorithm
S S’ A A’ Correctness Proofs • Models, proofs (safety and liveness) • Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00] • Proof Extension Theorem: • Used new methods for the safety proofs.
Performance Analysis • Analyze time from when network stabilizes until GC delivers new views to clients. • System is a composition: • Network service, GM services, VSGC processes • Compositional analysis: • Analyze the VSGC algorithm alone, in terms of its inputs and timing assumptions. • State reasonable performance guarantees for GM, Network. • Combine to get conditional performance properties for the system as a whole.
Analysis of VSGC algorithm • Assume component C stabilizes: • GM delivers same views to VSGC processes • Net provides reliable communication with latency . • Let • T[start], T[view] be times of last GM events for C • be upper bound on local step time. • Then VSGC outputs new views by time max (T[start] + + x, T[view]) +
view(v) view(v) + x Analysis of VSGC Algorithm VS Algorithm Net Event start start GM algorithm T[start] T[view]
Assumed Bounds for GM T[start] T[view] • Bounds for “Fast Path” of [Keidar, et al. 00],observed empirically in almost all cases. start start view(v) GM
view(v) Combining VSGC and GM Bounds + x • Bounds for system, conditional on GM bounds. VSGC start start view(v) T[start] T[view] GM
3. Optimistic Virtual Synchrony[Sussman, Keidar, Marzullo 00] • Most GC algorithms block sending during reconfiguration. • OVS service provides: • Optimistic view proposal, before reconfiguration. • Optimistic sends after proposal, during reconfiguration. • Deliveries of optimistic messages in next view, subject to application policy. • Useful for applications: • Replicated data management • State transfer • Sending vectors of data
4. GC Service Specifications [Chockler, Keidar, Vitenberg 01] • Comprehensive set of specifications for properties guaranteed by GC services. • Unifying framework. • Safety properties • Membership: View order, partitionable, primary component • Multicast: Sending view delivery, virtual synchrony • Safe notifications • Ordering, reliability: FIFO, causal, totally ordered, atomic • Liveness properties • For eventually stable components: View stability, multicast delivery, safe notification liveness • For eventually stable pairs
II. Non-View-Oriented Group Communication Totally Ordered Multicast with QoS [Bar-Joseph, Keidar, Anker, Lynch 00, 01]
Totally Ordered Multicast with QoS • Multicast to dynamic group, subject to joins, leaves, and failures. • Global total ordering of messages • QoS: Message delivery latency • Built on reliable network with latency guarantees • Add ordering guarantees, preserve latency bounds. • Applications • State machine replication • Distributed games • Shared editing
Two Algorithms • Algorithm 1: Basic Totally Ordered Multicast • Sends, receives consistent with total ordering of messages. • Non-failing processes agree on messages from non-failing processes. • Latency: Constant, even with joins, leaves, failures. • Algorithm 2: Atomic Multicast • Non-failing processes agree on all messages. • Latency: • Joins, leaves only: Constant • With failures: Linear in f TOM fail_i fail_j Net
Local Node Process join leave rcv(m) mcast(m) Ord_i joiners(s,J), leavers(s,J) end-slot(s) FrontEnd_i members(s,J) Memb_i mcast(m) join leave mcast(join) mcast(leave) progress(s,j) Sniffer_i rcv(m) Net
Local Algorithm Operation • FrontEnd divides time into slots, tags messages with slots. • Ord delivers messages by slot, in order of process indices. • Memb determines slot membership. • Join, leave messages • Failures: • Algorithm 1 uses local failure detector. • Algorithm 2 uses consensus on failures. • Requires new dynamic version of consensus. • Timing-dependent
Net GM Architecture for Algorithm 2 TO-QoS
SRM[Floyd, et al.] • Reliable multicast to dynamic group. • Built over IP multicast • Based on requests (NACKs) and retransmissions • Limits duplicate requests/retransmissions using: • Deterministic suppression: Ancestors suppress descendants, by scheduling requests/replies based on distance to source. • Probabilistic suppression: Siblings suppress each other, by spreading out requests/replies.
SRM Architecture SRM IPMcast
New Protocol • Inspired by SRM • Assume future losses occur on same link (locality). • Uses deterministic suppression for siblings • Elects, caches best requestor and retransmitter • Chooses requestor closest to source. • Chooses retransmitter closest to requestor. • Break ties with processor ids.
Best Requestor and Retransmitter S Retransmitter Requestor
Performance Analysis • Metrics: • Loss recovery latency: Time from detection of packet loss to receipt of first retransmission • Loss recovery overhead: Number of messages multicast to recover from a message loss • Protocol performance benefits: • Removes delays caused by probabilistic suppression • Following election of requestor and retransmitter: • Reduces latency by using best requestor and retransmitter. • Reduces overhead by using single requestor and retransmitter.
Incremental modeling and proof methods [Keidar, Khazan, Lynch, Shvartsman 00] Proof Extension Theorem Arose in Scalable GC work [Keidar, Khazan 00] Hybrid Input/Output Automata [Lynch, Segala, Vaandrager 01] Model for continuous and discrete system behavior Useful for mobile computing? Conditional performance analysis methods For analyzing communication protocols AFOSR MURI project (Berkeley) III. Mathematical Foundations
I A O IV. IOA Language and Tools
I A O IOA Language and Tools • Language for describing I/O automata: Garland, Lynch • Use to describe services and algorithms. • Front end:Garland • Translates to Java objects • Completely rewritten this year. • Still needs support for composition. • Theorem-prover connection: Garland, Bogdanov • Connection with LP • Seeking connections: SAL, Isabelle, STeP, NuPRL
IOA Language and Tools • Simulator:Chefter, Ramirez, Dean • Has support for paired simulation. • Needs additions. • Being instrumented for invariant discovery using Ernst’s Daikon tool • Code generator:Tauber, Tsai • Local code-gen (translation to Java) running. • Needs composition, communication service calls, correctness proof. • Challenge examples
read write P2 P1 Pn Memory Memory Models • Establishes a general mathematical framework for specifying and reasoning about multiprocessor memories and the programs that use them. • Also applies to distributed shared memory.
Memory Models • Sequentially consistent memory: • Operations appear to happen in some sequential order. • Read operation returns latest value written to the location. • Processor consistent memory: • Reads overtake writes to other locations. • SPARC TSO, IBM 370 • Coherent memory • Memory with synchronization commands: • Fences, barriers, acquire/release,… • Release consistency, weak ordering, locking • Transactional memory
Programming restrictions: • Data-race-free (for use with weak ordering) • Properly labelled (for use with release consistency) • Two-phase locking