Fault Tolerance in an Event Rule Framework for Distributed Systems

Fault Tolerance in an Event Rule Framework for Distributed Systems Hillary Caituiro Monge

Contents • Introduction • Related Works • Overview of the Event Rule Framework (ERF) • Overview of the Fault Tolerant CORBA • Design of the Fault tolerant ERF (FT-ERF) • Performance Analysis • Conclusions

Introduction • Justification • Distributed Systems (DS) • Fault Tolerance (FT) • Reactive Components (RC) • The Event Rule Framework (ERF) • Motivation • Objectives

Distributed Systems (DS) • A DS is a • Collection of software components distributed among processors of heterogeneous platforms • DSs purpose are: • Sharing resources and workload, and • Maximizing availability. • Design goals of DSs are: • Transparency, • Scalability, • Reliability and • Performance.

Fault Tolerance (FT) • FT is the ability of a system to continue operating as expected, despite internal or external failures. • DSs are prone to failures. • Some faults can be detected. • Some others cannot be detected. • FT of a DS can be improved through the redundancy, i.e. replication of its hardware or software components.

Reactive Components (RC • Reactive Components • React to external stimulus (i.e. events) • Initiate action • A RC can be • Asynchronous or synchronous • Non-deterministic or deterministic • A reactive component can be asynchronous and non-deterministic (ANDRC).

The Event Rule Framework (ERF) • An example of a DS framework having ANDRCs is: • ERF (Event/Rule Framework). • ERF • Developed at the Center for Computing Research and Development of University of Puerto Rico – Mayagüez Campus. • It is an Event-Rule Framework for developing distributed systems. • In ERF, Events and rules are used as abstractions for specifying system behavior.

Motivation (1/2) • There is a challenge to achieve fault tolerance in ANDRCs. • In non-deterministic components: • The output could be different; • Even if the same sequence of stimuli is input with the same initial state. • Since the component is asynchronous: • Timing assumptions are not valid. • Moreover, ANDRCs behavior fulfills the Heisenberg’s uncertainly principle:

Motivation (2/2) • Existing fault-tolerance techniques • Failure detectors • Timing assumptions • Synchronous or semi synchronous systems, • State transfer protocols • Deterministic systems • Very intrusive • Duplicates detection and suppression mechanisms • Sequencers

Objectives (1/2) • This research is about the use of active and semi-active replication techniques for achieving fault tolerance in ERF, which is a framework that uses ANDRCs. • Active replication technique • All replicated components accept third-party incoming events. • A middle-tier component is in charge of • Event multicasting • Detecting and suppressing duplicated events.

Objectives (2/2) • Semi-active replication technique • All replicated components accept third-party incoming events • Only one (“the leader”) is able to post events, • Backup replicas listen to the leader to make a consistent production of events. • Each replicated component is in charge of the detection and suppression of duplicated events.

Related Works • Generic support of FT in DSs • FT Event-Based DSs

Generic support of FT in DSs • OMG Fault-Tolerant CORBA Standard

FT Event-Based DSs

Overview of the Event Rule Framework (ERF) • Model • Event Model • Rule Model • Behavioral Model • Components • Event Channel • RUBIES • Architecture of ERF-CORBA

Event Model ERF provides the event abstraction to represent significant occurrences in a distributed system. i.e. Flood alert system. The base class Event defines the structure and behavior applicable to all types of events. Model package erf; import erf.lang.*; import java.io.Serializable; publicclass Event implements Serializable { /* Attributes */ public String id = ""; public TimeValue ttl; public TimeValue daytime; public DistributedObject producer; /* Methods */ public TimeValue t() {...} public TimeValue ts() {...} public TimeValue ttl() {...} publicvoid setttl(long tv) {...} public DistributedObject getProducer() {...} publicvoid setProducer(DistributedObject producer) {...} publicvoid sett(long tv) {...} public String pName() {...} public boolean isDead() {...} public String getTypeName() {...} } Figure 3.2 Java definition of the class Event

Rule Model In ERF, the behavior of a DS is defined in terms of rules. A rule is an algorithm that is triggered when events in the event set match a rule’s event pattern Model [package <package_specification> ] rule <rule_id> [priority<priority_number>] on<trigger_events> [use<usage_specification>] [if<condition>then<actions> [else<alternative_actions>]] [do<unconditional_actions>] Figure 3.5 Syntax of rule definition language (RDL)

Model • Behavioral Model • Defines how rules are triggered and evaluated upon the occurrence of events. • Evaluation of rules needs to be made periodically because RUBIES receive events constantly. • The evaluation of rules is performed based on a rule priority.

Components (1/2) • Event Channel • Is a middleware distributed component • It allows sending events to consumers. • It allows receiving events from producers. • Events are treated as objects.

Components (2/2) • Rule Based Intelligent Event Service (RUBIES) • Is the main component of ERF. • It is an engine that handles events through the evaluation of rules. • RUBIES is a distributed component • It is registered to the event channel both as a consumer and as a producer.

Architecture of ERF-CORBA Figure 3.8 Architecture of ERF-CORBA

Overview of the Fault Tolerant CORBA (FT-CORBA) • Fault Tolerant CORBA (FT-CORBA) • Replication Management • Fault Management • Logging and Recovery Management

Fault Tolerant CORBA • Adopted by OMG through 2000. • Commitments rather than a solution. • Full interoperability among different products. • It provides support for applications that require • High levels of reliability • With minimal modifications. • This research was addressed to be compliant with this standard.

Replication Management • Replication management covers a Fault Tolerant Domain. • It is done through the Replication Manager component, which inherits from the Property Manager, Object Group Manager, and Generic Factory components. Figure 4.3 Hierarchy of the Replication Management

Fault Management • It includes the Fault Notification, Fault Detection, and Fault Analysis services. • The Fault Notifier sends informs to its consumers. • The Fault Detectors are connected to replicas or host and provide “faults” to the Fault Notifier. • The Fault Analyzer analyzes faults and produce reports to the Fault Notifier. Figure 4.8 Architecture of Fault Management

Logging and Recovery Management • Loggin mechanism. • Log the state of the primary member. • Recovery mechanism • Act on fails or for new members. • Recover from the log to the new primary. • Consistency must be controlled by the infrastructure.

Design of the Fault tolerant ERF (FT-ERF) • Scalability and Fault Tolerance Problems in ERF CORBA • Architecture of Scalable and Fault Tolerant ERF • Architecture of Fault-Tolerant ERF-CORBA • EID Uniqueness • Events and Pattern equality rules. • Pattern Management • Active Replication • Semi-Active Replication

RUBIES (b) (a) RULES DB Scalability and Fault Tolerance Problems in ERF CORBA Figure 5.1 Two possible points of scalability and fault-tolerance problems in ERF: (a) the size of the rules database; (b) a crash of RUBIES.

DISTRIBUTION DIMENSION RUBIES(γ11,δ1) RUBIES(γ21,δ2) RUBIES(γN1,δN) RUBIES(γ12,δ1) RUBIES(γ22,δ2) RUBIES(γN2,δN) REPLICATION DIMENSION RUBIES(γ1M,δ1) RUBIES(γ2M,δ2) RUBIES(γNM,δN) Figure 5.3 Architecture of Scalable and fault-tolerant ERF

Figure 5.3 Architecture of FT ERF-CORBA

EID Uniqueness (1/2) • Each event in the system need to be uniquely identified by an event identifier • EID. • EID uniqueness must guaranteed in different contexts • Local, replication group, system. • The use of sequencers is an option to achieve EID uniqueness • Each replica start a sequencer. • But, is only valid with deterministic components.

EID Uniqueness (2/2) • Events can be identified by its history. • Each event is produced due to an event pattern. • Such history includes • The list of previous events that triggered the event. • The function or rule that caused its production. Figure 5.5 Conceptual View of the Event Unique Identification

EVENT EQUALITY RULE • Two events are equal if: • Both are of the same Type. • Both were produced due to the same Rule. • Both have the same Order of production in the time when the Rule was triggered. • Both have the same Pattern.

PATTERN EQUALITY RULE • Two event patterns are equal if: • Both have the same number of events. • Both have events in the same order. • Two events for the same position are equal if the Event Equality Rule is accomplished as previously defined.

Pattern Management (1/2) • Rules use a pattern management framework • To prevent events being triggered more than once for a given event pattern. • In this framework, patterns are defined in terms of: • Source events (i.e., events that cause rules to trigger) and • Target events (i.e., events that are produced by rules).

Pattern Management (2/2) • The framework has three main components for pattern management: • Pattern Manager to manage patterns of events. • Pattern to store patterns of events. • Indexer to organize patterns of events. Figure 5.6 Architecture of Pattern Management

Active Replication (AR) • For systems with tight time constraints. • All replicas are running at the same time. • Are accepting events. • Are sending events. • So, duplicated events are going around. • Therefore, it is crucial. • To detect and suppress duplicated-events. • To deliver a unique reply. • To keep consistency. • To be fault tolerant transparent.

AR: Pattern Naming • For Duplicated-Events Detection and Suppression • Is a centralized Mid-tier component that • Through an analysis of an event’s history • Detects if the event has already been delivered. • It relies on two primitives: • Event binding • Register an event. • Pattern solving. • Resolve if an equivalent event was already delivered.

AR: Pattern Naming Figure 5.9 Architecture of the Pattern Naming

Semi-Active Replication (SAR) • For systems with relatively loose time constraints. • All replicas are running at the same time. • Only the primary is able to reply to clients. • When the primary fails, a new member is selected. • When a backup member fails, it is released from the group. • Failure detectors are used to detect failures in group members. • Time delay before the selection of new primary (sec).

SAR: Production Controller • For Duplicated-Events Detection and Suppression • It is distributed within each replica. • The following algorithm is executed on backup members. On incoming event P from the primary • If in queue BQis an equivalent event Bfor the event P then • UpdateB.id with P.id across the entire system • RemoveP • Else • Enqueue P in PQ On produced event B from the backup • If in queue PQis an equivalent event Pfor the event B then • UpdateB.id with P.id across the entire system • RemoveP • Else • Enqueue B in BQ On fail and if the backup replica is elected as new primary • Post all events of the queue BQ

SAR: Production Controller Figure 5.14 Architecture of the Production Controller

6. Performance Analysis • Objectives • Methodology • Test Scenarios • Test Procedure • Test Results

Objectives • Measure the execution time of fault-tolerant ERF using active and semi-active replication techniques for: • An increasing number of replicas. • An Increasing number of failures. • An increasing workload. • Compare the execution time of: • Active versus semi-active replication techniques. • Failure-free versus failure execution scenarios. • Fault-tolerant versus non fault-tolerant execution.

Test Scenarios: Services distribution Figure 7.1 UML deployment diagram of the test environment. (The domain for all computers is ece.uprm.edu)

Test Scenarios: Failure schedule: First scenario • Six workstations, • 3 to 8 replicas. • 193 rules. • Failure schedule defined by power set F. Where • n is the number of replicas • f(p=n) = ∞ • f(p=1...n-1) = p*T/n determines the time of the failure • p is the position of the replica in the sub set • T is the arithmetic average of the execution time of ten free failure runs with n replicas.

Test Scenarios: Failure schedule: Second scenario • Ten workstations, • Ten replicas. • 193 rules. • Failure schedule defined by set G. Where • n is the number of replicas • g(p=n) = ∞ • g(p=1...n-1) = p*T/n determines the time of the failure • p is the position of the replica in the sub set • T is the arithmetic average of the execution time of ten free failure runs with n replicas.

Test Scenarios: Failure scheduleThird scenario • Ten workstations, • Ten replicas. • Six rule sets of 6, 12, 24, 48, 96, and 193 rules each time. • The failure schedule was given by the function G(n) defined for the second scenario.

Test Scenarios: Test application • Client consumer/producer of the event channel. • It starts sending two events of GageLevelReport type to start the test • It ends its execution when an event of the TestEventEnd type arrives. • Measures the execution time • Starting just after second event is posted, and • Ending just after a event of TestEventEnd type arrives.

Methodology: Test Procedure • The procedure consisted of three major steps: • Clear the environment; • Launch the infrastructure; and • Run the test application. • The results are • The arithmetic media of 10 runs on each test case. • The arithmetic media of the standard deviation was 1.46%

Fault Tolerance in an Event Rule Framework for Distributed Systems

Fault Tolerance in an Event Rule Framework for Distributed Systems

Presentation Transcript

Fault Tolerance in Distributed Systems

FT-ERF Fault-Tolerance in an Event Rule Framework for Distributed Systems

Scalability and Fault-Tolerance in an Event Rule Framework for Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance in Embedded Systems

Part 2: Fault-Tolerance Distributed Systems 2010

Self-Stabilization: An approach for Fault-Tolerance in Distributed Systems

Introspective Fault Tolerance for Exascale Systems

Fault Tolerance Distributed

Fault Tolerance

Fault-tolerance in Component-based Systems

Fault Tolerance in Distributed and RT Systems

Real-Time Distributed Discrete-Event Execution with Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance Chapter – 7 (Distributed Systems)

Fault Tolerance in Distributed Systems 05.05.2005 Naim Aksu

Self-Stabilization: An approach for Fault-Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems