400 likes | 422 Views
Explore new building blocks for dynamic distributed systems, including global services and algorithms. Discuss a Middleware Service Catalog to document middleware services' specifications, promoting discussion and formal analysis. Learn about high-performance, fault-tolerant distributed systems through rigorous mathematical models. Discover current subprojects and collaborators working on scalable group communication, dynamic atomic broadcast, and more.
E N D
New Directions for NEST Research Nancy Lynch MIT NEST Annual P.I. Meeting Bar Harbor, Maine July 12, 2002 …
My Group’s Work and NEST • New building blocks (global services and distributed algorithms) for dynamic, fault-prone, distributed systems. • Interacting state machine semantic models, including timing, hybrid continuous/discrete, probabilistic behavior. Composition, abstraction. • Formal methods/tools to support reasoning about distributed systems: Conditional performance analysis methods, IOA language/tools. • System modeling.
A Suggestion for NEST Research: a Middleware Service Catalog • Virtually everyone here is developing middleware services: • Clock synchronization, location services, routing, reliable communication, consensus, group membership, group communication, object management services, publish-subscribe, network surveillance, reconfiguration, authentication, key distribution,… • But it’s not always obvious exactly what these services guarantee: • API, functionality, conditional performance guarantees, fault-tolerance guarantees
Middleware Service Catalog • Idea:Create and maintain a catalog of specifications for NEST middleware services. • High-level descriptions of requirements • Assumptions and guarantees • API, functionality, conditional performance, fault-tolerance • Formal, informal • Models of the distributed algorithms used in the various implementations. • Claims about the properties satisfied by the algorithms. • Models for the underlying platforms. • Why this would be useful: • Another kind of output, complementary to demos. • Basis for discussion/clarification/comparison. • Will help bring implementations together. • Basis for formal analysis. • Can help in developing algorithmic theory for NEST-like systems.
Building Blocks for High-Performance, Fault-Tolerant Distributed Systems Nancy Lynch MIT NEST Annual P.I. Meeting Bar Harbor, Maine July 12, 2002 …
… … Service … Net Our Current Project (NSF-ITR and AFOSR) • Design and analyze building blocks for computing in highly dynamic distributed settings: • Global service specifications: • Distributed algorithms that implement them: • Dynamic systems: • Internet, mobile computing • Joins, leaves, failures • Contrast: Traditional theory of distributed systems deals mostly with static systems, with fixed sets of processes.
… … Net Our Project • We present everything rigorously, using mathematical interacting state machine models (I/O automata). • Formal service specifications • Formal algorithm descriptions • Formal models for applications • Prove correctness, using invariants and simulation relations • Analyze performance, fault-tolerance • Develop supporting theory • Apply the theory to software systems
Current Subprojects • Scalable group communication [Khazan, Keidar, Lynch, Shvartsman] • Dynamic Atomic Broadcast [Bar-Joseph, Keidar, Lynch] • Reconfigurable Atomic Memory [Lynch, Shvartsman] • Communication protocols [Livadas, Lynch, Keidar, Bakr] • Peer-to-peer computing [Lynch, Malkhi, Ratajczak, Stoica] • Fault-tolerant consensus [Keidar, Rajsbaum] • Foundations:[Lynch, Segala, Vaandrager, Kirli] • Applications: • Toy helicopter [Mitra, Wang, Feron], • Video streaming[Livadas, Nguyen, Zakhor], • Unmanned flight control [Ha,Kochocki,Tanzman], • Agent programming [Kawabe]
People • Project leader: Nancy Lynch • Postdocs: Idit Keidar, Dilsun Kirli • PhD students: Roger Khazan, Carl Livadas, Ziv Bar-Joseph, Rui Fan, Sayan Mitra, Seth Gilbert • MEng students: Omar Bakr, Matt Bachmann, Vida Ha • Other collaborators: Alex Shvartsman, Dahlia Malkhi, David Ratajczak, Ion Stoica, Sergio Rajsbaum, Roberto Segala, Frits Vaandrager, Yong Wang, Eric Feron, Thinh Nguyen, Avideh Zakhor, Joe Kochocki, Alan Tanzman, Yoshinobu Kawabe…
This talk: • Scalable Group Communication • Dynamic Atomic Broadcast • Reconfigurable Atomic Memory
GCS 1. Scalable Group Communication [Keidar, Khazan 00, 02] [Khazan 02][K,K,Lynch, Shvartsman 02] …
GCS Group Communication Services • Cope with changing participants using abstract groups of client processes with changing membership sets. • Processes communicate with group members indirectly, by sending messages to the group as a whole. • GC servicessupport management of groups: • Maintain membership information. • Form new views in response to changes. • Manage communication. • Communication respects views. • Provide guarantees about ordering, reliability of message delivery. • Virtual synchrony • Systems; Isis, Transis, Totem, Ensemble,…
Group Communication Services • Advantages: • High-level programming abstraction • Hides complexity of coping with changes • Disadvantages: • Can be costly, especially when forming new views. • May have problems scaling to large networks. • Applications: • Managing replicated data • Distributed multiplayer interactive games • Multi-media conferencing, collaborative work
GCS Memb GCS Net New GC Service for WANs [Khazan] • New specification, including virtual synchrony. • New algorithm: • Uses separate scalable membership service, implemented on a small set of membership servers [Keidar, Sussman, Marzullo, Dolev]. • Multicast implemented on all the nodes. • View change uses only one round for state exchange, in parallel with membership service’s agreement on views. • Participants can join during view formation.
S S’ A A’ New GC Service for WANs • Distributed implementation[Tarashchanskiy] • Safety proofs, using new incremental proof methods [Keidar, Khazan, Lynch, Shvartsman 00]. • Liveness proofs • Performance analysis • Analyze time from when network stabilizes until GCS announces new views. • Analyze message latency. • Conditional analysis, based on input, failure, and timing assumptions. • Compositional analysis, based on performance of Membership Service and Net. • Also modeled and analyzed data-management application running on top of the new GCS.
DAB 2. Early-Delivery Dynamic Atomic Broadcast[Bar-Joseph, Keidar, Lynch, DISC 02]
Dynamic Atomic Broadcast • Atomic broadcast with latency guarantees, in a dynamic setting where processes may join, leave, or fail. • We define the DAB problem, and present and analyze a new distributed algorithm to solve it. • In the absence of failures: Constant latency, even when participants join and leave. • With failures: Latency linear in the number of failures. • Uses a new distributed consensus service, in which participants do not know who the other participants are. • We define the CUP problem, and present and analyze a new algorithm to solve it. • Algorithm improves upon previously-suggested algorithms using group communication.
The DAB Problem join join-ack leave-ack rcv(m) … join-ack join leave mcast(m) • Problem: Guarantee participants receive consistent sequences of messages. Fast delivery, even with joins, leaves. • Safety: Sending, receiving orders are consistent with a single global message ordering S. No gaps. • Liveness: Eventual join-ack, leave-ack. Eventual delivery, including the first message the process itself sends. • Application: Distributed multiplayer interactive games. DAB
Implementing DAB join • Processes: • Timing-dependent, have approximately-synchronized clocks. • Net: • Dynamic network, pairwise FIFO delivery • Low latency • Does not guarantee a single total order, nor that all processes see the same messages from a failing process. DAB net-join Net
Implementing DAB • Key difficulties: • Network doesn’t guarantee a single total order. • Different processes may receive different final messages from a failed process. • So, processes coordinate message delivery: • Divide time into slots using local clock, assign each message to a slot. • Deliver messages in order of (slot, sender id). • Determine members of each slot, deliver only from members. • Processes must agree on slot membership • Joining (leaving) process selects join-slot (leave-slot), informs other processes. • Failed process triggers consensus.
Using Consensus for DAB • When process j fails, a consensus service is used to agree on j’s failure slot. • Requires a new kind of consensus service, which: • Does not assume participants are known a priori; lets each participant say who it thinks the other participants are. • Allows processes to abstain. • Example: i joins around when consensus starts. j1 thinks i is participating, j2 thinks not. i cannot participate as usual, because j2 ignores it, but cannot be silent, because j1 waits for it. So i abstains. • We define new Consensus with Unknown Participants (CUP) service. • Use separate CUP(j) service to decide on failure slot for j.
The DAB Algorithm Using CUP DAB fail fail DABi1 DABi2 CUP(j) Net
decide(v) init(v,W) init(v,W) abstain leave leave-detect(j) fail-detect(j) CUP The CUP Problem • Guarantees agreement, validity, termination. • Assumes submitted worlds are “close”: • Process that initiates is in other processes’ worlds • Process in anyone’s world initiates, abstains, leaves, or fails.
Wegive a new early-stopping consensus algorithm. Similar to previous algorithms, e.g., [Dolev, Reischuk, Strong 90]. But tolerates: Uncertainty about participants, Processes leaving. Terminates in two rounds when failures stop (even if leaves continue). Latency linear in number of actual failures The CUP Algorithm CUP Net
DAB DABi1 DABi2 CUP(j1) Net The DAB Algorithm Using CUP
Discussion: DAB • Modular: DAB algorithm, CUP, Network • Modularity needed for keeping the complexity under control. • Initial presentation was intertwined, not modular. • Correctness of CUP (agreement, validity, termination) used to prove correctness of DAB (atomic broadcast safety and liveness guarantees). • Latency bounds for CUP used to prove latency bounds for DAB.
RAMBO 3. Reconfigurable Atomic Memory for Basic Objects[Lynch, Shvartsman, DISC 02]
RAMBO • Defined new service: Reconfigurable Atomic Memory for Basic Objects (dynamic atomic read/write shared memory). • Developed new, efficient, modular distributed algorithm to implement RAMBO. • Highly survivable; tolerates joins, leaves, failures. • Tolerates short-term changes by using quorums. • Tolerates long-term changes by reconfiguring. • Reconfigures on-the-fly; no heavyweight view change. • Maintains atomicity across configuration changes. • Can be used in mobile or peer-to-peer settings. • Applications: Battle data for teams of soldiers, game data for players in multiplayer game.
Static Quorum-Based Atomic Read/Write Memory Implementation[Attiya, Bar-Noy, Dolev] • Read, Write use two phases: • Phase 1: Read (value, tag) from a read-quorum • Phase 2: Write (value,tag) to a write-quorum • Write determines largest tag in phase 1, picks a larger one, writes new (value, tag) in phase 2. • Read determines latest (value,tag) in phase 1, propagates it in phase 2, then returns the value. • Could return unconfirmed value after phase 1. • Highly concurrent. • Quorum intersection property implies atomicity.
How to make this dynamic? • Quorum members may join, leave, fail; need to reconfigure. • Idea: Any member of current quorum configuration can propose a new configuration. • Questions: • How to agree on new configuration? • How to install it? • How to preserve atomicity of data during reconfiguration? • How to avoid stopping Reads/Writes in progress?
read, write new-config Recon Net recon Our RAMBO Algorithm • Uses a separate reconfiguration service. Recon
recon-ack recon Recon Consensus Net Recon Using Consensus • Recon service uses (static) consensus services to determine new configurations 1, 2, 3,… • Consensus is a fairly heavyweight mechanism, but: • Only used for reconfigurations, which are presumably infrequent. • Does not delay Read/Write operations (unlike GCS approaches).
Consensus Implementation decide(v) init(v) • Use a variant of Paxos algorithm [Lamport] • Agreement, validity guaranteed absolutely. • Termination guaranteed when underlying system stabilizes. • Leader chosen using failure detectors; conducts two-phase algorithm with retries. init(v) Consensus
read, write new-config Recon Net Read/Write Algorithm using Recon • Read/write processes run two-phase static quorum-basedalgorithm, using current configuration. • Use gossiping and fixed point tests rather than highly structured communication. • When Recon provides new configuration, R/W uses both. • Do not abort R/W in progress, but do extra work to access additional processes needed for new quorums.
Removing Old Configurations • Read/Write algorithm removes old configurations by garbage-collecting them in the background. • Two-phase garbage-collection procedure: • Phase 1: Inform write-quorum of old configuration about the new configuration. Collect latest value from read-quorum of old configuration. • Phase 2: Inform write-quorum of new configuration about latest value. • Garbage-collection concurrent with Reads/Writes. • Implemented using gossiping and fixed points.
Discussion: RAMBO • Highly modular: R/W algorithm, Recon service, Consensus, Leader election, Network • Modularity needed for keeping the complexity under control. • Correctness proofs: • Atomicity of Reads and Writes • Latency bounds: • For reading, writing, garbage-collection. • Under various assumptions about timing, joins, failures, and rate of reconfiguration. • LAN implementations begun.
P A S C Hybrid I/O Automata (HIOA)[Lynch, Segala, Vaandrager 01, 02] • Mathematical model for hybrid (continuous/discrete) system components. • Discrete actions, continuous trajectories • Supports composition, levels of abstraction. • Case studies: • Automated transportation systems • Quanser helicopter system [Mitra, Wang, Feron, Lynch]
Timed I/O Automata, Probabilistic,… • Timed I/O Automata [Lynch, Segala, Vaandrager, Kirli]: • For modeling and analyzing timing-based systems, e.g., most of the building blocks of our AFOSR project. • Support composition, abstraction. • Collecting ideas from many research papers. • Probabilistic I/O automata [Lynch, Segala, Vaandrager]: • For modeling systems with random behavior. • Composition, abstraction aspects still need development. • Need to be combined with timed/hybrid models.
Conclusions • Three main building blocks (services and algorithms) for dynamic systems: • Scalable Group Communication • Dynamic Atomic Broadcast • Reconfigurable Atomic Memory • Auxiliary building blocks: group membership, Consensus with Unknown Participants, reconfiguration • Much remains to be done, to produce a “complete” set of useful building blocks for dynamic systems, and a good algorithmic theory for this area. • Connections with NEST?