Fault Tolerant Distributed Computing system.

Fault Tolerant Distributed Computing system.

Fundamentals • What is fault? • A fault is a blemish, weakness, or shortcoming of a particular hardware or software component. • Fault, error and failures • Why fault tolerant? • Availability, reliability, dependability, … • How to provide fault tolerance ? • Replication • Checkpointing and message logging • Hybrid

Message Logging • Tolerate crash failures • Each process periodically records its local state and log messages received after • Once a crashed process recovers, its state must be consistent with the states of other processes • Orphan processes • surviving processes whose states are inconsistent with the recovered state of a crashed process • Message Logging protocols guarantee that upon recovery no processes are orphan processes

Message logging protocols • Pessimistic Message Logging • avoid creation of orphans during execution • no process p sends a message m until it knows that all messages delivered before sending m are logged; quick recovery • Can block a process for each message it receives - slows down throughput • allows processes to communicate only from recoverable states; synchronously log to stable storage any information that may be needed for recovery before allowing process to communicate

Message Logging • Optimistic Message Logging • take appropriate actions during recovery to eliminate all orphans • Better performance during failure-free runs • allows processes to communicate from non-recoverable states; failures may cause these states to be permanently unrecoverable, forcing rollback of any process that depends on such states

Causal Message Logging • Causal Message Logging • no orphans when failures happen and do not block processes when failures do not occur. • Weaken condition imposed by pessimistic protocols • Allow possibility that the state from which a process communicates is unrecoverable because of a failure, but only if it does not affect consistency. • Append to all communication information needed to recover state from which communication originates - this is replicated in memory of processes that causally depend on the originating state.

KAN – A Reliable Distributed Object System • Developed at UC Santa Barbara • Project Goal: • Language support for parallelism and distribution • Transparent location/migration/replication • Optimized method invocation • Fault-tolerance • Composition and proof reuse

System Description Kan source Kan Compiler Java bytecode + Kan run-time libraries JVM JVM JVM UNIX sockets

Fault Tolerance in Kan • Log-based forward recovery scheme: • Log of recovery information for a node is maintained externally on other nodes. • The failed nodes are recovered to their pre-failure states, and the correct nodes keep their states at the time of the failures. • Only consider node crash failures. • Processor stops taking steps and failures are eventually detected.

Logical Node y Logical Node x Failure handler Fault Detector Request handler Communication Layer Basic Architecture of the Fault Tolerance Scheme Physical Node i External Log IP Address Network

Logical Ring • Use logical ring to minimize the need for global synchronization and recovery. • The ring is only used for logging (remote method invocations). • Two parts: • Static part containing the active correct nodes. It has a leader and a sense of direction: upstream and downstream. • Dynamic part containing nodes that trying to join the ring • A logical node is logged at the next T physical nodes in the ring, where T is the maximum number of nodes failures to tolerate.

Logical Ring Maintenance • Each node participating in the protocol maintains a variables: • Failedi(j): true if i has detected the failure of j • Mapi(x): the physical node on which logical node x resides • Leaderi: i’s view of the leader of the ring • Viewi: i’s view of the logical ring (membership and order) • Pendingi: the set of physical nodes that i suspects of failing • Recovery_counti: the number of logical nodes that need to be recovered • Readyi: records whether I is active. • Initial set of ready nodes; new nodes become ready when they are linked into the ring.

Failure Handling • When node i is informed of failure of node j: • If every node upstream of i has failed, then I must become new leader. It remaps all logical nodes from the upstream physical nodes, and informs the other correct nodes by sending a remap message. It then recovers the logical nodes. • If the leader has failed but there is some upstream node k that will become the new leader, then just update the map and leader variables to reflect the new situation • If the failed node j is upstream of i, then just update map. If I is the next downstream node from j, also recover the logical nodes from j. • If j is downstream of i and there is some node k downstream of j, then just update map. • If j is downstream of I and there is no node downstream of j, then wait for the leader to update map. • If i is the leader and must recover j, then change map, send a remap message to change the correct nodes’ maps, and recover all logical nodes that are mapped locally

Physical Node and Leader Recovery • When a physical node comes back up: • It sends a join message to the leader. • The leader tries to link this node in the ring: • Acquire <-> Grant • Add, Ack_add • Release • When the leader fails, the next downstream node in the ring becomes the new leader.

AQuA • Adaptive Quality of Service Availability • Developed in UIUC and BBN. • Goal: • Allow distributed applications to request and obtain a desired level of availability. • Fault tolerance • replication • reliable messaging

Features of AQuA • Uses the QuO runtime to process and make availability requests. • Proteus dependability manager to configure the system in response to faults and availability requests. • Ensemble to provide group communication services. • Provide CORBA interface to application objects using the AQuA gateway.

Proteus functionality • How to provide fault tolerance for appl. • Style of replication (active, passive) • voting algorithm to use • degree of replication • type of faults to tolerate (crash, value or time) • location of replicas • How to implement chosen ft scheme • dynamic configuration modification • start/kill replicas, activate/deactivate monitors,voters

Group structure • For reliable mcast and pt-to-pt. Comm • Replication groups • Connection groups • Proteus Communication Service Group for replicated proteus manager • replicas and objects that communicate with the manager • e.g. notification of view change, new QuO request • ensure that all replica managers receive same info • Point-to-point groups • proteus manager to object factory

AQuA Architecture

Fault Model, detection and Handling • Object Fault Model: • Object crash failure - occurs when object stops sending out messages; internal state is lost • crash failure of an object is due to the crash of at lease one element composing the object • Value faults - message arrives in time with wrong content (caused by application or QuO runtime) • Detected by voter • Time faults • Detected by monitor • Leaders report fault to Proteus; Proteus will kill objects with fault if necessary, and generate new objects

AQuA Gateway Structure

Egida • Developed in UT, Austin • An object-oriented, extensible toolkit for low-overhead fault-tolerance • Provides a library of objects that can be used to composelog-basedrollback recovery protocols. • Specification language to express arbitrary rollback-recovery protocols

Log-based Rollback Recovery • Checkpointing • independent, coordinated, induced by specific patterns of communication • Message Logging • Pessimistic, optimistic, causal

Core Building Blocks • Almost all the log-based rollback recovery protocols share event-driven structures • The common events are: • Non-deterministic events • Orphans, determinant • Dependency-generating events • Output-commit events • Checkpointing events • Failure-detection events

A grammar for specifying rollback-recovery protocols Protocol := <non-det-event-stmt>* <output-commit-event-stmt>* <dep-gen-event-stmt> <ckpt-stmt>op t <recovery-stmt>op t <non-det-event-stmt> := <event> : determinant : <determinant-structure> <Log <event-info-list><how-to-log>on <stable-storage>>opt <output-commit-event-stmt> := <output-commit-proto> output commit on < event-list> <event> := send | receive | read | write <determinant-structure> := {source, sesn, dest, dest} <output-commit-proto> := independent | co-ordinated <how-to-log> := synchronously | asynchronously <stable-storage> := local disk | volatile memory of self

Egida Modules • EventHandler • Determinant • HowToOutputCommit • LogEventDeterminant • LogEventInfo • HowToLog • WhereToLog • StableStorage • VolatileStorage • Checkpointing • …

Fault Tolerant Distributed Computing system.

Fault Tolerant Distributed Computing system.

Presentation Transcript

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Systems

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Consensus problem in fault tolerant distributed computing

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

A Fully Distributed, Fault-Tolerant Data Warehousing System

ECE 753: FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

ITEC452 Distributed Computing Lecture 11 Fault Tolerant Systems

FAULT-TOLERANT COMPUTING

Fault-Tolerant Computing Basics

ECE 753: FAULT-TOLERANT COMPUTING

Fault Tolerant Distributed Computing system.

Fault-tolerant Computing

Fault-Tolerant Computing Basics