Giray KÃ¶mÃ¼rcÃ¼

ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING(Autonomic and Trusted Computing 2006) Giray Kömürcü

OPEN DISTRIBUTED SYSTEMS • One of the most succesfull structures designed in computer community • Have side-effects as: • Unanticipated runtime events • Reconfiguration burdens due to environmental changes • Increasing complexity limits development

OPEN DISTRIBUTED SYSTEMS • Reliability depends on both failures and performance • Required Reliability has to be maintained • A set of complex requirements needed due to fluctuations in the environment and its unpredictability

ACTIVE FAULT-TOLERANT MODEL • Exploits the knowledge of pre-fault behaviour to predict environmental faults and failures • Reduces the unpredictable nature of failures upto a certain limit • Provides proactive approach to achieve required reliability

ACTIVE FAULT-TOLERANT MODEL • Tolerates current failures that could not be predicted • Maintains user specified reliability by proper replication strategies • Uses the information extracted from the system

ACTIVE FAULT-TOLERANT MODEL

PROACTIVE APPROACH of AFT MODEL • Design a mechanism to forecast faults and failures • If AFT predicts a high chance of system failure it takes necessary steps to avoid failure • Aim is to employ available information about suspected failures to provide required reliability

REAL-TIME APPROACH of AFT MODEL • Some failures can not be predicted before they actually occur • Based on real-time decision making and reconfiguring according to current failures • First identifies then tolerates by adaptation strategies

AFT STRATEGIES • Replication is a complex function • Replication degree, Replica placement, Replication protocol, Communication between replicas • A single replication strategy is not enough to achieve the required reliability

ADJUSTING the DEGREE of REPLICATION • Optimal degree of replication can be achieved by AFT model • AFT policy may increase the degree of replication if a failure is more probable • AFT policy may decrease the degree of replication if a member leaves the system or to reduce communication costs

MIGRATION of CURRENT REPLICAS • Reliability does not depend on just number of replicas, but also their placement • Prime concern: which nodes should host replicas • Workload, storage capacity, bandwidth, reliability of server is concerned

SHIFTING into a SUITABLE REPLICATION PROTOCOL ADAPTIVELY

PRIMARY COPY REPLICATION • Any update of data sent to the primary copy first • Updates are propagated to back-up nodes asynchronously • Efficient in terms of communication when lots of write messages occur • Single point of failure problems

READ-ONE WRITE-ALL REPLICATION • Updates are performed anywhere in the system • Important when information has to be replicated immediately • Efficient when dealing with failures • Slow when significant amount of write operations needed

MAJORITY REPLICATION • It is an intermediate solution between the Primary Copy and ROWA replication • May be done in pair-wise manner • Principle selection is based on the trade of between reliability and communication cost

SHIFTING into a SUITABLE REPLICATION PROTOCOL ADAPTIVELY

RELAXED vs STRICT • Message Synchronization depends on network traffic by replication and communication overheads • Relaxed: • A set of updates in a single message within a time period • Less traffic • Guarantees consistency at a certain point • Loss of work is higher during a failure • Not consistent but efficient • Strict: • Each update by a single message • More traffic • Consistent at each point • Consistent but expensive

DESIGN of AFT MODEL ON JUICE OBJECT • Juice Model: Model for each replica • Based on adaptable object model • Reconfigures its internal object at run time • Consists of five internal elements

DESIGN of AFT MODEL ON JUICE OBJECT • AFT provides adaptation facilities as designed on the Juice Object model • Adaptation Handler(AH), Replication Handler(RH), Underlying System Information Evaluator(USIE), Client Member Information Evaluator (CMIE)

AFT FRAMEWORKCollection of Information • USIE runs on each replica to collect the local resource information: usage patterns of resources, information of underlying system failures • Each machine holds a monitor object

Collection of Information • CMIE handles both the current replica’s information and most recently connected client’s information(message failure rate, response time, network latency) • Gathered from the communicator of the Juice Model

Collection of Information

Information Analysis • Adaptation Handler(AH) analyses the suspected or known system faults and failures using the available information • Predicts future faults and estimates current reliability of the system • Carries out a cost-benefit analysis considering user requirements • If needed AH selects the best strategy • Number of replicas, placement, replication protocol

Information Analysis • Selection of a suitable protocol should follow agreement of all AH’s of the replica group • One random member collects the votes of the replicas • Replicas switch to new protocol simultaneously according to the decision

Execution of New Strategy • AH notifies Replication Handler(RH) to replace themselves with the new object • Since the model is based on two configuration levels switching between strategies does not lead to inconsistencies

CONCLUSION • Describes the design of AFT model which allows user to specify reliability and performance • AFT employs a combination of proactive and real-time fault-tolerant approachs in open-distributed systems • Proactive approach exploits the knowledge from USIE & CMIE to warn against probable faults, reduce the failures and increase the performance significantly

CONCLUSION • Real-time approach deals with the current faults • A single replication protocol can not cope with environmental fluctuations • AFT uses three main strategies to fullfill the needs of the system • AFT allows the system to reconfigure and execute under different situations and therefore tightly integrated with the environmenral changes

REFERANCE • Lanka R., Oda K., Yoshida T.: ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING. Autonomic and Trusted Computing, (2006)

QUESTIONS? THANK YOU FOR LISTENING

Giray KÃ¶mÃ¼rcÃ¼

Giray KÃ¶mÃ¼rcÃ¼

Presentation Transcript

Relationships Between Parental Psychological Control, Rejection Sensitivity and Prosocial/Aggressive Behaviors in Younge

Turkish Pupil Monitoring System

TOWARDS EQUIVALENCE CHECKING BETWEEN TLM and RTL MODELS

HOW SHOULD THE FAMILY PHYS I C I ANS' BEHAV I ORS BE IN CASE OF MIGRATION ? - A MODEL STUDY

CMPE 511 ON CHIP NETWORKS: A Scalable, Communication-Centric Embedded System Design Paradigm