470 likes | 603 Views
Characterization and Mitigation of Failures in Complex Systems. Dr. K. S. Trivedi Dr. Y. Hong. Center for Advanced Computing and Communications Dept. of Electrical & Computer Eng Duke University Durham, NC 27708-0294. System Dependability. Faults.
E N D
Characterization and Mitigation of Failures in Complex Systems Dr. K. S. Trivedi Dr. Y. Hong Center for Advanced Computing and Communications Dept. of Electrical & Computer Eng Duke University Durham, NC 27708-0294
System Dependability Faults Impairments Errors Failures Fault prevention (FP) Procurement Fault tolerance (FT) Dependability Means Fault removal (FR) Validation Fault forecasting (FF) Performability Reliability + Performance Availability + Service Survivability Attributes Security Safety Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
System Dependability FP Minimal maintenance Redundancy: hardware,software,information,time FT Diversity: data, design, environment Applications Verification (testing) FR Maintenance: corrective (repair) and preventive Probabilistic modeling FF Non-probabilistic modeling Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Modeling and Analysis Fault forecasting Fault removal Probabilistic Modeling Methods & Tools Fault tolerance Fault prevention Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
A Research Triangle Theory Modeling methods and Numerical solutions: CTMC, SRN, MRSPN FSPN, NHCTMC, FTREE Applications Tools SHARPE SPNP SREPT SRA Hardware and software: Fault-tolerant Systems Real-time Systems Computer Networks Wireless Networks Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Modeling Methods Fault trees SMP MRGP DTMC, CTMC NHCTMC SRN Probabilistic modeling SPN MRSPN FSPN SDE FBM, FGN HS Nonlinear dynamics Non-probabilistic modeling HS Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
CTMC SMP MRGP CTMC SMP state state t t T ~ Exp T ~ Gen T ~ Exp state MRGP t T1 ~ Gen T1 ~ Gen Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Relation: Dependability Models GSPN CTMC SRN MRM FTRE RG RBD FT Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Modeling of HS and FSPN • Statistical multiplexed ATM switch and HS model: B TRAFFIC B t ATM SWITCH High priority Low priority Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Modeling of HS and FSPN FSPN model for ATM switch K Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Solution Techniques • Algorithms • Fast algorithms for network reliability and fault tree computation • Fast algorithms for SRN • Numerical and simulation tools • SHARPE (symbolic hierarchical automated reliability and performance evaluator) • SREPT (software reliability estimation prediction tool) • SPNP (stochastic Petri net package) • SRA (software rejuvenation agents) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Architecture of SHARPE ftree SMP/MRGP Markov block Hierarchical & Hybrid Compositions GSPN/SRN relgraph graph pfqn, mfqn Reliability/availability Performability Performance analysis Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Architecture of SPNP Stochastic Reward Net (SRN) models Markovian Stochastic Petri Net Non-Markovian Stochastic Petri Net Fluid Stochastic Petri Net (FSPN) Reachability Graph Analytic-Numeric Method Discrete Event Simulation (DES) Steady-State Transient Steady-State Transient SOR Std. Uniformization Batch Means Indep. Replication Gauss-Seidel Fox-Glynn Method Regenerative Simulation Restart Fast Methods Power Method Stiff Uniformization Importance Sampling Importance Splitting Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Architecture of SREPT SREPT Architecture-based approach Black-box approach Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Black-box Approach Estimate of the number of faults Fault Density Software Product/Process Metrics + Past experience Regression Tree Predicted 1. failure intensity 2. #faults remaining 3. reliability after release Test coverage ENHPP model Sequence of Interfailure times ENHPP model Predicted 1. failure intensity 2. #faults remaining 3. reliability after release 4. coverage ENHPP model Sequence of Interfailure times & Release criteria Optimization & ENHPP model Predicted release time and Predicted coverage Predicted 1. failure intensity 2. #faults remaining 3. reliability after release NHCTMC Failure intensity Debugging policies Discrete-event simulation Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Architecture Analytical modeling Reliability and Performance prediction Failure behavior of components Discrete-event simulation Architecture-based Approach Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Challenges Fault trees using BDD Largeness Automated generation/solution based SPN, SRN Markov model Hierarchical methods, fixed point iteration Stiffness NHCTMC Non-exponential distributions SMP, MRGP DES (Discrete-Event Simulation) Combined performance and reliability/availability MRM SRN FSPN HS Joint continuous & discrete state spaces SDE Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Applications • For high-performance and high-dependability • Software • Software reliability (FR,FF) • Software rejuvenation (FF,FR) • Software fault-tolerance (FT,FF) • Hardware (with software) • Preventive maintenance (FR,FF) • Availability model (FF) • Upgrade design (FT,FF) • Performability of wireless communications (FF) • Transient behavior of ATM networks with failure (FF) • Error recovery in communication networks (FF) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Software Reliability • Early prediction of quality based on software product/process metrics. • Reliability growth modeling based on failure and coverage data collected during testing • Architecture-based software reliability • Developed a tool (SREPT) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Software Rejuvenation Definition: proactive fault management technique for the systems to counteract the effect of aging Approaches: • Periodic: Rejuvenation at regular (deterministic) time intervals irrespective of system load • Periodic and instantaneous load: Rejuvenation at regular intervals and wait until system load is zero • Prediction-based: Rejuvenation at times based on estimated times to failure (implemented in IBM Netfinity Director) • Purely time-based estimation • Time and workload-based estimation Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Recovering Available A C B Analytical Modeling Undergoing rejuvenation Subordinated non-homogeneous CTMCfor t = d Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Numerical Example Service rate and failure rate are functions of time, m(t) and r(t) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Measurement-based Estimation • Objective detection and validation of software aging • Basic idea periodically monitor and collect data on the attributes responsible for determining the health of the executing software • Quantifying the effect of aging proposed metric - Estimated time to exhaustion Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Non-parametric Regression Smoothing of Rossby data - Real Memory Free Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Software Fault Tolerance • Recovery block: • common-mode software failures, • common-mode acceptance test failures, • failures clustered in the input stream • A generalized model for recovery block strategy to capture the dependencies: • operate input within a recovery block execution • evaluate module output in the recovery block • difficult inputs clustered in input space Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
SRN Model for Recovery Block A recovery block with clustered failures Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Hardware Maintenance • Conditioned based maintenance: • CTMC: closed form solution • SHARPE: numerical solution • Find optimal inspection interval Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Hardware Maintenance Comparison: Closed form result & Numerical result Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Hardware Maintenance • Time based maintenance: • SMP based solution (PM: preventive maintenance) • Find optimal maintenance interval Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Availability Models Availability models commonly used in practice assume that times to outage and recovery are exponentially distributed. • How accurate will the all-exponential models be for systems w/ limited information of outage/recovery? • Can we give availability bounds for such systems? Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Result of a General Model • For a system with multiple types of outage-recovery (non-exponentially distributed outrages), the underlying stochastic process is a semi-Markov process (SMP). • We give a closed-form formula of system availability. • Findings – • Only the mean value of time-to-recovery (E[TTR]) affects the availability of systems. The distribution does not matter. • However, the distribution of Time-to-outage (TTO) does affect availability. Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Upgrade with Redundancy • Load-sharing clustering is common in networks (wired, wireless, optical) and server systems. • How to take advantage of the cluster structure and redundancy to upgrade hardware and software components? • How to quantify system performance for different upgrade schemes? Active node Standby node Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Upgrade Schemes • Direct transfer: Simple, but long downtime, for non-realtime-critical system. • Phased upgrade: Almost zero downtime, upgrade paradox, version incompatibility, etc. Baseline model Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Control channel down System down(!) Performability of Cellular Control Channel Protection • Each cell has Nb base repeaters (BR) • Each BR provides M TDM channels • One control channel resides in one of the BRs Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
control_down causes only system partially down no longer a full outage! Asys Pblock Pdrop APS Automatic Protection Switch Upon control_down, the failed control channel is automatically switched to a channel on a working base repeater. Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Model of System w/o APS CTMC State (b,k) b=No. of BR up k=No. of talking channels Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Model of System w/ APS CTMC State (b,k) b=No. of BR up k=No. of talking channels Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University A Segment of the Composite Markov Chain Model
Numerical Results Handoff Call Blocking Probability Improvement by APS Unavailability in handoff call dropping probability Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
s0 ON l1 0FF l0 s1 ATM Networks under Overloads • To quantify the effects of transients in ATM networks with • Markov Modulated Poisson (MMPP) used to allow for • Correlated arrivals • Transient effects: relaxation time, maximum overshoot, • and Expected excess number of losses in overload High Low 2-state MMPP traffic model Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Queuing Model for ATM A failure occurs in the connection (N1 and N3). Therefore, the traffic destined for N3 is re-routed through N2 Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
SRN Model for ATM Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Numerical Results Loss probability vs. burstiness Queue length vs. burstiness Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Error Recovery in Communication Study error recovery in communication networks using (non-exponential detection/restoration times) MRGP 1:N protection switching: Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Two Modeling Methods CTMC is used with ignoring the detection and restoration times MRGP (a non-Markovian modeling) is used with including the detection and restoration times Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Numerical Result Comparison of CTMC and non-Markovian model Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
Open Problems • Theoretical studies in the modeling and analysis of complex systems and failures • Capabilities, limitations, and relationships among formalisms such as FSPN and HS, FSPN and SDE. • Fast algorithms • Fast algorithms for FSPN • Fast algorithms for discrete-event simulation • Fast algorithms for non-Markovian models • Applications • Software: further exploration of software rejuvenation • Performability analysis of (wired and wireless) networks Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University
The End Thank you! Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University