Making Services Fault Tolerant

Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek Department of Computer Science and Engineering Humboldt University Berlin

Outline • Introduction • Problem Statement • Methodologies for Web Service Reliability • New Reliable Web Service Paradigm • Road Map for Experiment • Experimental Results and Discussion • Conclusion

Introduction • Service-oriented computing is becoming a reality. • Service-oriented Architectures (SOA) are based on a simple model of roles. • The problems of service dependability, security and timeliness are becoming critical. • We propose experimental settings and offer a roadmap to dependable Web services.

Problem Statement • Fault-tolerant techniques • Replication • Diversity • Replication is one of the efficient ways for providing reliable systems by time or space redundancy. • Increasing the availability of distributed systems • Key components are re-executed or replicated • Protect against hardware malfunctions or transient system faults. • Another efficient technique is design diversity. • By independently designing software systems or services with different programming teams, • Resort in defending against permanent software design faults. • We focus on the analysis of the replication techniques when applied to Web services. • A generic Web service system with spatial as well as temporal replication is proposed and investigated.

Methodologies for reliable Web services -- Redundancy • Spatial redundancy • Static redundancy, all replicas are active at the same time and voting takes place to obtain a correct result. • Dynamic redundancy engages one active replica at one time while others are kept in an active or in standby state. • Temporal redundancy • Redundant in time

Methodologies for reliable Web services -- Diversity • Protect redundant systems against common-mode failures • With different designs and implementations, common failure modes will probably cause different error effects. • N-version programming, recovery blocks…

Failure Response Stages of Web Services • Fault confinement • Fault detection • Diagnosis • Fail-over • Reconfiguration • Recovery • Restart • Repair • Reintegration

Fault Confinement Offline Online Fault Detection Fault Detection Failover Diagnosis Repair Recovery Reconfiguration Restart Reintegration

Replication Manager 6. Invoke web service Web Service Web service selection algorithm • Create web services • Select primary web • service (PWS) Web Service Web Service IIS Application IIS IIS Database WatchDog Application Application Database Database • Keep check the availability of the PWS • If PWS failed, reselect the PWS. Client 3. Register 9. Update the WSDL Port Application UDDI Database Registry 4. Look up WSDL 5. Get WSDL Proposed Paradigm

Get reply Do not get reply Reselect a primary Web Service RM sends message to the Web Service All Service failed System Fail Map the new address to the WSDL Work Flow of the Replication Manager

Road Map for Experiment Research • Redundancy in time • Redundancy in space • Sequentially • Parallel • Majority voting using N modular redundancy • Diversified version of different services

Experiments • A series of experiments are designed and performed for evaluating the reliability of the Web service, • single service without replication, • single service with retry or reboot and, • service with spatial replication. • We will also perform retry or failover when the Web service is down.

None Retry/ Reboot Failover Both (hybrid) Single service, no retry 0 -- -- -- Single service with retry -- 1 -- -- Single service with reboot -- 2 -- -- Spatial replication -- -- 3 4 Summary of the experiments

Parameters Current setting/metric Request frequency 1 req/min Polling frequency 5 ms Number of replicas 5 Client timeout period for retry 10 s Failure rate λ # failures/hour Load (profile of the program) % or load function Reboot time 10 min Failover time 1 s Parameters of the Experiments

Experiments over 360 hour periods (43200 reqs) Number of failures Normal Number of failures Server busy Number of failures Server reboots periodically Exp 0 4928 6130 6492 Exp 1 2210 2327 2658 Exp 2 2561 3160 3323 Exp 3 1324 1711 1658 Exp 4 1089 1148 1325 Experimental Results Retry 11.97% to 4.93% Reboot 11.97% to 6.44% Failover 11.97% to 3.56% Retry and Failover 11.97% to 2.59%

Number of failure when the server is is normal situation

Number of failure when the server is busy

Number of failure when the server reboots periodically

Reliability of the system over time

Reliability Model

ID Description Value λn Network failure rate 0.02 λ* Web service failure rate 0.228 λ1 Resource problem rate 0.142 λ2 Entry point failure rate 0.150 μ* Web service repair rate 0.286 μ1 Resource problem repair rate 0.979 μ2 Entry point failure repair rate 0.979 C1 Probability that the RM responds on time 0.9 C2 Probability that the server reboots successfully 0.9 Reliability Model Parameters

Outcome (SHARPE) Reliability of the proposed system Failure Rate 0.228 0.114 0.057

Conclusion • Surveyed replication and design diversity techniques for reliable services. • Proposed a hybrid approach to improving the availability of Web services. • Carried out a series of experiments to evaluate the availability and reliability of the proposed Web service system. • N-Version Programming may finally become commercially viable in service environment.

Making Services Fault Tolerant

Making Services Fault Tolerant

Presentation Transcript

Fault Tolerant and Resilient Web Services

Fault Tolerant Video-On-Demand Services

Fault-Tolerant Broadcast

Making Cloud Intermediate Data Fault-Tolerant

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault-Tolerant Consensus

Fault Tolerant Backplane

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing