630 likes | 783 Views
Fault Tolerance in an Event Rule Framework for Distributed Systems. Hillary Caituiro Monge. Contents. Introduction Related Works Overview of the Event Rule Framework (ERF) Overview of the Fault Tolerant CORBA Design of the Fault tolerant ERF (FT-ERF) Performance Analysis Conclusions.
E N D
Fault Tolerance in an Event Rule Framework for Distributed Systems Hillary Caituiro Monge
Contents • Introduction • Related Works • Overview of the Event Rule Framework (ERF) • Overview of the Fault Tolerant CORBA • Design of the Fault tolerant ERF (FT-ERF) • Performance Analysis • Conclusions
Introduction • Justification • Distributed Systems (DS) • Fault Tolerance (FT) • Reactive Components (RC) • The Event Rule Framework (ERF) • Motivation • Objectives
Distributed Systems (DS) • A DS is a • Collection of software components distributed among processors of heterogeneous platforms • DSs purpose are: • Sharing resources and workload, and • Maximizing availability. • Design goals of DSs are: • Transparency, • Scalability, • Reliability and • Performance.
Fault Tolerance (FT) • FT is the ability of a system to continue operating as expected, despite internal or external failures. • DSs are prone to failures. • Some faults can be detected. • Some others cannot be detected. • FT of a DS can be improved through the redundancy, i.e. replication of its hardware or software components.
Reactive Components (RC • Reactive Components • React to external stimulus (i.e. events) • Initiate action • A RC can be • Asynchronous or synchronous • Non-deterministic or deterministic • A reactive component can be asynchronous and non-deterministic (ANDRC).
The Event Rule Framework (ERF) • An example of a DS framework having ANDRCs is: • ERF (Event/Rule Framework). • ERF • Developed at the Center for Computing Research and Development of University of Puerto Rico – Mayagüez Campus. • It is an Event-Rule Framework for developing distributed systems. • In ERF, Events and rules are used as abstractions for specifying system behavior.
Motivation (1/2) • There is a challenge to achieve fault tolerance in ANDRCs. • In non-deterministic components: • The output could be different; • Even if the same sequence of stimuli is input with the same initial state. • Since the component is asynchronous: • Timing assumptions are not valid. • Moreover, ANDRCs behavior fulfills the Heisenberg’s uncertainly principle:
Motivation (2/2) • Existing fault-tolerance techniques • Failure detectors • Timing assumptions • Synchronous or semi synchronous systems, • State transfer protocols • Deterministic systems • Very intrusive • Duplicates detection and suppression mechanisms • Sequencers
Objectives (1/2) • This research is about the use of active and semi-active replication techniques for achieving fault tolerance in ERF, which is a framework that uses ANDRCs. • Active replication technique • All replicated components accept third-party incoming events. • A middle-tier component is in charge of • Event multicasting • Detecting and suppressing duplicated events.
Objectives (2/2) • Semi-active replication technique • All replicated components accept third-party incoming events • Only one (“the leader”) is able to post events, • Backup replicas listen to the leader to make a consistent production of events. • Each replicated component is in charge of the detection and suppression of duplicated events.
Related Works • Generic support of FT in DSs • FT Event-Based DSs
Generic support of FT in DSs • OMG Fault-Tolerant CORBA Standard
Overview of the Event Rule Framework (ERF) • Model • Event Model • Rule Model • Behavioral Model • Components • Event Channel • RUBIES • Architecture of ERF-CORBA
Event Model ERF provides the event abstraction to represent significant occurrences in a distributed system. i.e. Flood alert system. The base class Event defines the structure and behavior applicable to all types of events. Model package erf; import erf.lang.*; import java.io.Serializable; publicclass Event implements Serializable { /* Attributes */ public String id = ""; public TimeValue ttl; public TimeValue daytime; public DistributedObject producer; /* Methods */ public TimeValue t() {...} public TimeValue ts() {...} public TimeValue ttl() {...} publicvoid setttl(long tv) {...} public DistributedObject getProducer() {...} publicvoid setProducer(DistributedObject producer) {...} publicvoid sett(long tv) {...} public String pName() {...} public boolean isDead() {...} public String getTypeName() {...} } Figure 3.2 Java definition of the class Event
Rule Model In ERF, the behavior of a DS is defined in terms of rules. A rule is an algorithm that is triggered when events in the event set match a rule’s event pattern Model [package <package_specification> ] rule <rule_id> [priority<priority_number>] on<trigger_events> [use<usage_specification>] [if<condition>then<actions> [else<alternative_actions>]] [do<unconditional_actions>] Figure 3.5 Syntax of rule definition language (RDL)
Model • Behavioral Model • Defines how rules are triggered and evaluated upon the occurrence of events. • Evaluation of rules needs to be made periodically because RUBIES receive events constantly. • The evaluation of rules is performed based on a rule priority.
Components (1/2) • Event Channel • Is a middleware distributed component • It allows sending events to consumers. • It allows receiving events from producers. • Events are treated as objects.
Components (2/2) • Rule Based Intelligent Event Service (RUBIES) • Is the main component of ERF. • It is an engine that handles events through the evaluation of rules. • RUBIES is a distributed component • It is registered to the event channel both as a consumer and as a producer.
Architecture of ERF-CORBA Figure 3.8 Architecture of ERF-CORBA
Overview of the Fault Tolerant CORBA (FT-CORBA) • Fault Tolerant CORBA (FT-CORBA) • Replication Management • Fault Management • Logging and Recovery Management
Fault Tolerant CORBA • Adopted by OMG through 2000. • Commitments rather than a solution. • Full interoperability among different products. • It provides support for applications that require • High levels of reliability • With minimal modifications. • This research was addressed to be compliant with this standard.
Replication Management • Replication management covers a Fault Tolerant Domain. • It is done through the Replication Manager component, which inherits from the Property Manager, Object Group Manager, and Generic Factory components. Figure 4.3 Hierarchy of the Replication Management
Fault Management • It includes the Fault Notification, Fault Detection, and Fault Analysis services. • The Fault Notifier sends informs to its consumers. • The Fault Detectors are connected to replicas or host and provide “faults” to the Fault Notifier. • The Fault Analyzer analyzes faults and produce reports to the Fault Notifier. Figure 4.8 Architecture of Fault Management
Logging and Recovery Management • Loggin mechanism. • Log the state of the primary member. • Recovery mechanism • Act on fails or for new members. • Recover from the log to the new primary. • Consistency must be controlled by the infrastructure.
Design of the Fault tolerant ERF (FT-ERF) • Scalability and Fault Tolerance Problems in ERF CORBA • Architecture of Scalable and Fault Tolerant ERF • Architecture of Fault-Tolerant ERF-CORBA • EID Uniqueness • Events and Pattern equality rules. • Pattern Management • Active Replication • Semi-Active Replication
RUBIES (b) (a) RULES DB Scalability and Fault Tolerance Problems in ERF CORBA Figure 5.1 Two possible points of scalability and fault-tolerance problems in ERF: (a) the size of the rules database; (b) a crash of RUBIES.
DISTRIBUTION DIMENSION RUBIES(γ11,δ1) RUBIES(γ21,δ2) RUBIES(γN1,δN) RUBIES(γ12,δ1) RUBIES(γ22,δ2) RUBIES(γN2,δN) REPLICATION DIMENSION RUBIES(γ1M,δ1) RUBIES(γ2M,δ2) RUBIES(γNM,δN) Figure 5.3 Architecture of Scalable and fault-tolerant ERF
EID Uniqueness (1/2) • Each event in the system need to be uniquely identified by an event identifier • EID. • EID uniqueness must guaranteed in different contexts • Local, replication group, system. • The use of sequencers is an option to achieve EID uniqueness • Each replica start a sequencer. • But, is only valid with deterministic components.
EID Uniqueness (2/2) • Events can be identified by its history. • Each event is produced due to an event pattern. • Such history includes • The list of previous events that triggered the event. • The function or rule that caused its production. Figure 5.5 Conceptual View of the Event Unique Identification
EVENT EQUALITY RULE • Two events are equal if: • Both are of the same Type. • Both were produced due to the same Rule. • Both have the same Order of production in the time when the Rule was triggered. • Both have the same Pattern.
PATTERN EQUALITY RULE • Two event patterns are equal if: • Both have the same number of events. • Both have events in the same order. • Two events for the same position are equal if the Event Equality Rule is accomplished as previously defined.
Pattern Management (1/2) • Rules use a pattern management framework • To prevent events being triggered more than once for a given event pattern. • In this framework, patterns are defined in terms of: • Source events (i.e., events that cause rules to trigger) and • Target events (i.e., events that are produced by rules).
Pattern Management (2/2) • The framework has three main components for pattern management: • Pattern Manager to manage patterns of events. • Pattern to store patterns of events. • Indexer to organize patterns of events. Figure 5.6 Architecture of Pattern Management
Active Replication (AR) • For systems with tight time constraints. • All replicas are running at the same time. • Are accepting events. • Are sending events. • So, duplicated events are going around. • Therefore, it is crucial. • To detect and suppress duplicated-events. • To deliver a unique reply. • To keep consistency. • To be fault tolerant transparent.
AR: Pattern Naming • For Duplicated-Events Detection and Suppression • Is a centralized Mid-tier component that • Through an analysis of an event’s history • Detects if the event has already been delivered. • It relies on two primitives: • Event binding • Register an event. • Pattern solving. • Resolve if an equivalent event was already delivered.
AR: Pattern Naming Figure 5.9 Architecture of the Pattern Naming
Semi-Active Replication (SAR) • For systems with relatively loose time constraints. • All replicas are running at the same time. • Only the primary is able to reply to clients. • When the primary fails, a new member is selected. • When a backup member fails, it is released from the group. • Failure detectors are used to detect failures in group members. • Time delay before the selection of new primary (sec).
SAR: Production Controller • For Duplicated-Events Detection and Suppression • It is distributed within each replica. • The following algorithm is executed on backup members. On incoming event P from the primary • If in queue BQis an equivalent event Bfor the event P then • UpdateB.id with P.id across the entire system • RemoveP • Else • Enqueue P in PQ On produced event B from the backup • If in queue PQis an equivalent event Pfor the event B then • UpdateB.id with P.id across the entire system • RemoveP • Else • Enqueue B in BQ On fail and if the backup replica is elected as new primary • Post all events of the queue BQ
SAR: Production Controller Figure 5.14 Architecture of the Production Controller
6. Performance Analysis • Objectives • Methodology • Test Scenarios • Test Procedure • Test Results
Objectives • Measure the execution time of fault-tolerant ERF using active and semi-active replication techniques for: • An increasing number of replicas. • An Increasing number of failures. • An increasing workload. • Compare the execution time of: • Active versus semi-active replication techniques. • Failure-free versus failure execution scenarios. • Fault-tolerant versus non fault-tolerant execution.
Test Scenarios: Services distribution Figure 7.1 UML deployment diagram of the test environment. (The domain for all computers is ece.uprm.edu)
Test Scenarios: Failure schedule: First scenario • Six workstations, • 3 to 8 replicas. • 193 rules. • Failure schedule defined by power set F. Where • n is the number of replicas • f(p=n) = ∞ • f(p=1...n-1) = p*T/n determines the time of the failure • p is the position of the replica in the sub set • T is the arithmetic average of the execution time of ten free failure runs with n replicas.
Test Scenarios: Failure schedule: Second scenario • Ten workstations, • Ten replicas. • 193 rules. • Failure schedule defined by set G. Where • n is the number of replicas • g(p=n) = ∞ • g(p=1...n-1) = p*T/n determines the time of the failure • p is the position of the replica in the sub set • T is the arithmetic average of the execution time of ten free failure runs with n replicas.
Test Scenarios: Failure scheduleThird scenario • Ten workstations, • Ten replicas. • Six rule sets of 6, 12, 24, 48, 96, and 193 rules each time. • The failure schedule was given by the function G(n) defined for the second scenario.
Test Scenarios: Test application • Client consumer/producer of the event channel. • It starts sending two events of GageLevelReport type to start the test • It ends its execution when an event of the TestEventEnd type arrives. • Measures the execution time • Starting just after second event is posted, and • Ending just after a event of TestEventEnd type arrives.
Methodology: Test Procedure • The procedure consisted of three major steps: • Clear the environment; • Launch the infrastructure; and • Run the test application. • The results are • The arithmetic media of 10 runs on each test case. • The arithmetic media of the standard deviation was 1.46%