1 / 78

Reliable Group Communication: a Mathematical Approach

…. GC. Reliable Group Communication: a Mathematical Approach. Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000. ?. ?. ?. ?. Dynamic Distributed Systems. Modern distributed systems are dynamic.

mimis
Download Presentation

Reliable Group Communication: a Mathematical Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GC Reliable Group Communication: a Mathematical Approach Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000

  2. ? ? ? ? Dynamic Distributed Systems • Modern distributed systems are dynamic. • Set of clients participating in an application changes, because of: • Network, processor failure, recovery • Changing client requirements • To cope with changes: • Use abstract groups of client processes with changing membership sets. • Processes communicate with group members by sending messages to the group as a whole.

  3. GC Group Communication Services • Support management of groups • Maintain membership info • Manage communication • Make guarantees about ordering, reliability of message delivery, e.g.: • Best-effort: IP Multicast • Strong consistency guarantees: Isis, Transis, Ensemble • Hide complexity of coping with changes

  4. This Talk • Describe • Group communication systems • A mathematical approach to designing, modeling, analyzing GC systems. • Our accomplishments and ideas for future work. • Collaborators: Idit Keidar, Alan Fekete, Alex Shvartsman, Roger Khazan, Roberto De Prisco, Jason Hickey, Robert van Renesse, Carl Livadas, Ziv Bar-Joseph, Kyle Ingols, Igor Tarashchanskiy

  5. Talk Outline I. Background: Group Communication II. Our Approach III. Projects and Results 1. View Synchrony 2. Ensemble 3. Dynamic Views 4. Scalable Group Communication IV. Future Work V. Conclusions

  6. I. Background: Group Communication

  7. ? ? ? ? The Setting • Dynamic distributed system, changing set of participating clients. • Applications: • Replicated databases, file systems • Distributed interactive games • Multi-media conferencing, collaborative work • …

  8. Groups • Abstract, named groups of client processes, changing membership. • Client processes send messages to the group (multicast). • Early 80s: Group idea used in replicated data management system designs • Late 80s: Separate group communication services.

  9. GC Group Communication Service • Communication middleware • Manages group membership, current views View = membership set + identifier • Manages multicastcommunication among group members • Multicasts respect views • Guarantees within each view: • Reliability constraints • Ordering constraints, e.g., FIFO from each sender, causal, common total order • Global service B A

  10. mcast receive new-view mcast new-view GCS receive Group Communication Service Client A Client B

  11. A B Isis [Birman, Joseph 87] • Primary component group membership • Several reliable multicast services, different ordering guarantees, e.g.: • Atomic Broadcast: Common total order, no gaps • Causal Broadcast: • When partition is repaired, primary processes send state information to rejoining processes. • Virtually Synchronous message delivery

  12. A B C D A B C D Example: Interactive Game • Alice, Bob, Carol, Dan in view {A,B,C,D} • Primary component membership • {A}{B,C,D} split; only {B,C,D} may continue. • Atomic Broadcast • A fires, B moves away; need consistent order

  13. Interactive Game • Causal Broadcast • C sees A enter a room; locks door. • Virtual Synchrony • {A}{BCD} split; B sees A shoot; so do C, D. A B C D A B C D

  14. Applications • Replicated data management • State machine replication [Lamport 78] , [Schneider 90] • Atomic Broadcast provides support • Same sequence of actions performed everywhere. • Example: Interactive game state machine • Stock market • Air-traffic control

  15. Transis [Amir, Dolev, Kramer, Malkhi 92] • Partitionable group membership • When components merge, processes exchange state information. • Virtual synchrony reduces amount of data exchanged. • Applications • Highly available servers • Collaborative computing, e.g. shared whiteboard • Video, audio conferences • Distributed jam sessions • Replicated data management [Keidar , Dolev 96]

  16. Other Systems • Totem [Amir, Melliar-Smith, Moser, et al., 95] • Transitional views, useful with virtual synchrony • Horus[Birman, van Renesse, Maffeis 96] • Ensemble[Birman, Hayden 97] • Layered architecture • Composable building blocks • Phoenix, Consul, RMP, Newtop, RELACS,… • Partitionable

  17. Service Specifications • Precise specifications needed for GC services • Help application programmers write programs that use the services correctly, effectively • Help system maintainers make changes correctly • Safety, performance, fault-tolerance • But difficult: • Many different services; different guarantees about membership, reliability, ordering • Complicated • Specs based on implementations might not be optimal for application programmers.

  18. Early Work on GC Service Specs • [Ricciardi 92] • [Jahanian, Fakhouri, Rajkumar 93] • [Moser, Amir, Melliar-Smith, Agrawal 94] • [Babaoglu et al. 95, 98] • [Friedman, van Renesse 95] • [Hiltunin, Schlichting 95] • [Dolev, Malkhi, Strong 96] • [Cristian 96] • [Neiger 96] • Impossibility results [Chandra, Hadzilacos, et al. 96] • But still difficult…

  19. II. Our Approach

  20. Approach Application • Model everything: • Applications • Requirements, algorithms • Service specs • Work backwards, see what the applications need • Implementations of the services • State, prove correctness theorems: • For applications, implementations. • Methods: Composition, invariants, simulation relations • Analyze performance, fault-tolerance. • Layered proofs, analyses Service Application Algorithm

  21. Math Foundation: I/O Automata • Nondeterministic state machines • Not necessarily finite-state • Input/output/internal actions (signature) • Transitions, executions, traces • System modularity: • Composition, respecting traces • Levels of abstraction, respecting traces • Language-independent, math model

  22. Typical Examples Modeled • Distributed algorithms • Communication protocols • Distributed data management systems

  23. Modeling Style • Describe interfaces, behavior • Program-like behavior descriptions: • Precondition/effect style • Pseudocode or IOA language • Abstract models for algorithms, services • Model several levels of abstraction, • High-level, global service specs … • Detailed distributed algorithms

  24. Modeling Style • Very nondeterministic: • Constrain only what must be constrained. • Simpler • Allows alternative implementations

  25. Describing Timing Features • TIOAs [Lynch, Vaandrager 93] • For describing: • Timeout-based algorithms. • Clocks, clock synchronization • Performance properties

  26. fail recover fail recover Describing Failures • Basic or timed I/O automata, with fail,recover input actions. • Included in traces, can use them in specs.

  27. Describing Other Features • Probabilistic behavior: PIOAs[Segala 95] • For describing: • Systems with combination of probabilistic + nondeterministic behavior • Randomized distributed algorithms • Probabilistic assumptions on environment • Dynamic systems: DIOAs[Attie, Lynch 99] • For describing: • Run-time process creation and destruction • Mobility • Agent systems [NTT collaboration]

  28. Using I/O Automata (General) • Specify systems precisely • Validate designs: • Simulation • State, prove correctness theorems • Analyze performance • Generate validated code • Study theoretical upper and lower bounds

  29. Using I/O Automata for Group Communication Systems • Use for global services + distributed algorithms • Define safety properties separately from performance/fault-tolerance properties. • Safety: • Basic I/O automata; trace properties • Performance/fault-tolerance: • Timed I/O automata with failure actions; timed trace properties

  30. III. Projects and Results

  31. Projects 1. View Synchrony 2. Ensemble 3. Dynamic Views 4. Scalable Group Communication

  32. 1. View Synchrony (VS) [Fekete, Lynch, Shvartsman 97, 00] Goals: • Develop prototypes: • Specifications for typical GC services • Descriptions for typical GC algorithms • Correctness proofs • Performance analyses • Design simple math foundation for the area. • Try out,evaluate our approach.

  33. View Synchrony What we did: • Talked with system developers (Isis, Transis) • Defined I/O automaton models for: • VS, prototype partitionable GC service • TO, non-view-oriented totally ordered bcast service • VStoTO, application algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser] • Proved correctness • Analyzed performance/ fault-tolerance.

  34. VStoTO Architecture brcv bcast TO VStoTO VStoTO gprcv newview gpsnd VS

  35. TO TO Broadcast Specification Delivers messages to everyone, in the same order. Safety: TO-Machine Signature: input: bcast(a,p) output: brcv(a,p,q) internal: to-order(a,p) State: queue, sequence of (a,p), initially empty for each p: pending[p], sequence of a, initially empty next[p], positive integer, initially 1

  36. Transitions: bcast(a,p) Effect: append a to pending[p] to-order(a,p) Precondition: a is head of pending[p] Effect: remove head of pending[p] append (a,p) to queue brcv(a,p,q) Precondition: queue[next[q]] = (a,p) Effect: next[q] := next[q] + 1 TO-Machine

  37. Performance/Fault-Tolerance TO-Property(b,d,C):If C stabilizes, then soon thereafter (time b), any message sent or received anywhere in C is received everywhere in C, within bounded time (time d). stabilize send receive b d

  38. VS VS Specification • Partitionable view-oriented service • Safety: VS-Machine • Views presented in consistent order, possible gaps • Messages respect views • Messages in consistent order • Causality • Prefix property • Safe indication • Doesn’t guarantee Virtual Synchrony • Like TO-Machine, but per view

  39. stabilize newview( v) mcast(v) receive(v) b d Performance/Fault-Tolerance VS-Property(b,d,C): If C stabilizes, then soon thereafter (time b), views known within C become consistent, and messages sent in the final view v are delivered everywhere in C, within bounded time (time d).

  40. VStoTO Algorithm • TO must deliver messages in order, no gaps. • VS delivers messages in orderper view. • Problems arise from view changes: • Processes moving between views could have different prefixes. • Processes could skip views. • Algorithm: • Real work done in majority views only • Processes in majority views totally order messages, and deliver to clients messages that VS has said are safe. • At start of new view, processes exchange state, to reconcile progress made in different majority views.

  41. Correctness (Safety) Proof • Show composition of VS-Machine and VStoTO machines implements TO-Machine. • Trace inclusion • Use simulation relation proof: • Relate start states, steps of composition to those of TO-Machine • Invariants, e.g.: Once a message is ordered everywhere in some majority view, its order is determined forever. • Checked using PVS theorem-prover, TAME [Archer] TO Composition

  42. Conditional Performance Analysis • Assume VS satisfies VS-Property(b,d,C): • If C stabilizes, then within time b, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within time d. • And VStoTO satisfies: • Simple timing and fault-tolerance assumptions. • Then TO satisfies TO-Property(b+d,d,C): • If C stabilizes, then within time b+d, any message sent or delivered anywhere in C is delivered everywhere in C, within time d.

  43. Conclusions: VS • Models for VS, TO, VStoTO • Proofs, performance/f-t analyses • Tractable, understandable, modular • [PODC 97], [TOCS 00] • Follow-on work: • Algorithm for VS [Fekete, Lesley] • Load balancingusing VS [Khazan] • Models for other Transis algorithms [Chockler] • But: VS is only a prototype; lacks some key features, like Virtual Synchrony • Next: Try a real system!

  44. 2. Ensemble [Hickey, Lynch, van Renesse 99] Goals: • Try, evaluate our approach on a real system • Develop techniques for modeling, verifying, analyzing more features, of GC systems, including Virtual Synchrony • Improve on prior methods for system validation

  45. Ensemble • Ensemble system [Birman, Hayden 97] • Virtual Synchrony • Layered design, building blocks • Coded in ML [Hayden] • Prior verification work for Ensemble and predecessors: • Proving local properties using Nuprl [Hickey] • [Ricciardi], [Friedman]

  46. Ensemble • What we did: • Worked with developers • Followed VS example • Developed global specs for key layers: • Virtual Synchrony • Total Order with Virtual Synchrony • Modeled Ensemble algorithm spanning between layers • Attempted proof; found logical error in state exchange algorithm (repaired) • Developed models, proofs for repaired system

  47. Conclusions: Ensemble • Models for two layers, algorithm • Tractable, easily understandable by developers • Error, proofs • Low-level models similar to actual ML code (4 to 1) • [TACAS 99] • Follow-on: • Same error found in Horus. • Incremental models, proofs [Hickey] • Next: Use our approach to design new services.

  48. 3. Dynamic Views [De Prisco, Fekete, Lynch, Shvartsman 98] Goals: • Define GC services that cope with both: • Long-term changes: • Permanent failures, new joins • Changes in the “universe” of processes • Transient changes • Use these to design consistent total order and consistent replicated data algorithms that tolerate both long-term and transient changes.

  49. A B C D E Dynamic Views • Many applications with strong consistency requirements make progress only in primary views: • Consistent replicated data management • Totally ordered broadcast • Can use staticnotion of allowable primaries,e.g., majorities of universe, quorums • All intersect. • Only one exists at a time. • Information can flow from each to the next. • But: Static notion not good for long-term changes

  50. A B C D E F Dynamic Views • For long-term changes, want dynamic notion of allowable primaries. • E.g., each primary might contain majority of previous: • But: Some might not intersect. Makes it hard to maintain consistency.

More Related