The Process Group Approach to Reliable Distributed Computing

The Process Group Approach to Reliable Distributed Computing Kenneth P. Birman, Cornell University,CSdept Presented by Constantin Serban, R.U.

Assumptions Problem • The traditional communication primitives are generally reliable, but exhibit problematic semantics during transient failures and system configuration changes Approach • Use a “glue” software layer with predictable, fault-tolerant, flexible, and reliable behavior Solution • Approach: use process groups along with group programming tools to achieve this goal

Paper Structure • Desirable system example • Process group models and requirements • Conventional communication primitives: common pitfalls • Failure assumptions • Group support requirements • Close synchrony versus virtual synchrony • The ISIS toolkit and ISIS based utilities • Potential application of ISIS • Conclusion

System example Brokerage and trading systems are required to integrate large numbers of demanding applications, timely reaction to high volumes of information. Goal: reliability, security, flexibility, availability, uniformity, etc • Information backplane requirements • Publish/subscribe model • Naming structure • Communication interface

Access restrictions • Selective history mechanism • Customization • Systematic organization • Flexible stream connection at runtime • Hierarchical structure System must checkpoint, replicate on independent machines, activate backup on failure Solution: distributed group of cooperating programs that adapt transparent to failures and recoveries

Process Group Types: Anonymous Publish/subscriber model. Properties: • Interaction exclusive through group address (no membership knowledge) • Exactly once delivery to all or none of the subscribers • Message delivered in an order consistent with casual dependencies • Logging of key events for history

Process Group Types: Explicit group Members cooperate directly, explicit membership knowledge. Additional needs: • Support for group communication, addressing, failure atomicity, message delivery ordering • Use of membership as input • Synchronization of shared information

Common communication primitives • Datagrams: unreliable, message loss, duplicates, out-of-order messages • RPC:reliable, sequenced message delivery. Fuzzy during failure: unable to distinguish between failure and delays or the precise moment of failure • Reliable data streams: outperforms RPC b/c of pipelining. Still no guarantees regarding consistent channel break and subsequent handling.

Failure Model Assumptions Fail-Stop model: • Processes or processors fail by halting, no erroneous actions • Integration of transport layer (TCP-like) with the failure detection layer • May trigger false fails on timeout • Maintain a system membership list of processes, a non-responsive process is dropped from the list, and forced to shut-down or reconnect • Rejoining processes are treated like completely new entities

Group Support: Addressing

The Process Group Approach to Reliable Distributed Computing

The Process Group Approach to Reliable Distributed Computing

Presentation Transcript

introduction to distributed computing

Distributed Computing Technologies – Selecting an Appropriate Approach

Distributed computing

Distributed Computing

Distributed Computing

DISTRIBUTED COMPUTING

Distributed Computing

Introduction to DISTRIBUTED COMPUTING

Distributed Computing

CS 194: Distributed Systems Process resilience, Reliable Group Communication

Distributed Computing

Reliable Group Communication: a Mathematical Approach

DISTRIBUTED COMPUTING

A generic approach to job tracking for distributed computing: the STAR approach

The DryadLINQ Approach to Distributed Data-Parallel Computing

Distributed Computing

Exploring the Collatz Conjecture: A Distributed Computing Approach

A Chemical Approach to Distributed Computing

Approach For The Reliable And Reasonable Shipping Process

DISTRIBUTED COMPUTING