100 likes | 230 Views
The Process Group Approach to Reliable Distributed Computing. Kenneth P. Birman, Cornell University,CSdept Presented by Constantin Serban, R.U. Assumptions. Problem
E N D
The Process Group Approach to Reliable Distributed Computing Kenneth P. Birman, Cornell University,CSdept Presented by Constantin Serban, R.U.
Assumptions Problem • The traditional communication primitives are generally reliable, but exhibit problematic semantics during transient failures and system configuration changes Approach • Use a “glue” software layer with predictable, fault-tolerant, flexible, and reliable behavior Solution • Approach: use process groups along with group programming tools to achieve this goal
Paper Structure • Desirable system example • Process group models and requirements • Conventional communication primitives: common pitfalls • Failure assumptions • Group support requirements • Close synchrony versus virtual synchrony • The ISIS toolkit and ISIS based utilities • Potential application of ISIS • Conclusion
System example Brokerage and trading systems are required to integrate large numbers of demanding applications, timely reaction to high volumes of information. Goal: reliability, security, flexibility, availability, uniformity, etc • Information backplane requirements • Publish/subscribe model • Naming structure • Communication interface
Access restrictions • Selective history mechanism • Customization • Systematic organization • Flexible stream connection at runtime • Hierarchical structure System must checkpoint, replicate on independent machines, activate backup on failure Solution: distributed group of cooperating programs that adapt transparent to failures and recoveries
Process Group Types: Anonymous Publish/subscriber model. Properties: • Interaction exclusive through group address (no membership knowledge) • Exactly once delivery to all or none of the subscribers • Message delivered in an order consistent with casual dependencies • Logging of key events for history
Process Group Types: Explicit group Members cooperate directly, explicit membership knowledge. Additional needs: • Support for group communication, addressing, failure atomicity, message delivery ordering • Use of membership as input • Synchronization of shared information
Common communication primitives • Datagrams: unreliable, message loss, duplicates, out-of-order messages • RPC:reliable, sequenced message delivery. Fuzzy during failure: unable to distinguish between failure and delays or the precise moment of failure • Reliable data streams: outperforms RPC b/c of pipelining. Still no guarantees regarding consistent channel break and subsequent handling.
Failure Model Assumptions Fail-Stop model: • Processes or processors fail by halting, no erroneous actions • Integration of transport layer (TCP-like) with the failure detection layer • May trigger false fails on timeout • Maintain a system membership list of processes, a non-responsive process is dropped from the list, and forced to shut-down or reconnect • Rejoining processes are treated like completely new entities