850 likes | 916 Views
Explore the impact of software aging on system reliability and the concept of software rejuvenation to mitigate failures. Learn about analytic models such as CTMC, MRSPN, SMP, and SRN, as well as measurement-based models. Discover how software faults can be classified as Bohrbugs, Mandelbugs, and aging-related bugs, and the importance of proactive fault management in software systems.
E N D
Software Aging & Rejuvenation DA-IICT Gandhinagar January 9, 2006 Kishor S. Trivedi Dept. of Electrical & Computer Engineering Duke University Durham, NC 27708 kst@ee.duke.edu www.ee.duke.edu/~kst www.software-rejuvenation.com
Outline • Introduction & Motivation • Software reliability and fault tolerance • Software Aging and Rejuvenation • Analytic Models • CTMC model • MRSPN model • SMP model • Cluster systems – SRN model • Degradation model • Measurement-based Models • Time-based • Time and workload-based • Software Rejuvenation in a Commercial Server • Summary and Conclusions
Motivation: Dependence on Computer Systems Communication Health & Medicine Avionics Entertainment Banking
Downtown Costs per Hour • Brokerage operations $6,450,000 • Credit card authorization $2,600,000 • eBay (1 outage 22 hours) $225,000 • Amazon.com $180,000 • Package shipping services $150,000 • Home shopping channel $113,000 • Catalog sales center $90,000 • Airline reservation center $89,000 • Cellular service activation $41,000 • On-line network fees $25,000 • ATM service fees $14,000 Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."
Softwarereliability is one of the weakest links in system reliability High Availability:Software is the problem (1) • Hardware fault tolerance, fault management, reliability/availability modeling relatively well developed • System outages more due to software faults
Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software High Availability:Software is the problem (2) • Fault avoidance through good software engineering practices difficult for large/complex software systems • Impossible to fully test and verify if software is fault-free • Yet there are stringent requirements for failure-free operation
Software Fault ToleranceTechniques • Design diversity • N-version programming • Recovery block • N-self check programming • Data diversity • N-copy programming
Design diversity based software fault tolerance expensive Data diversity may have limited applicability Stringent requirements for failure-free operation Software remains the problem
Software Fault ToleranceNew thinking • Environment diversity • Checkpointing and rollback, retry • Helps in dealing with hardware transients • But also helps in dealing with software bugs • Replication of software modules (applications) • Does it help? If yes, why? • Proactive fault management (software rejuvenation) • Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes. It allows proactive repairs to be carried at the discretion of the user/administrator, e.g., in the middle of the night
Outline • Introduction & Motivation • Software reliability and fault tolerance • Software Aging and Rejuvenation • Analytic Models • CTMC model • MRSPN model • SMP model • Cluster systems – SRN model • Degradation model • Measurement-based Models • Time-based • Time and workload-based • Software Rejuvenation in a Commercial Server • Summary and Conclusions
Software Aging • “Software Aging” phenomenon • Long-running software tends to show an increasing failure rate. • Not related to application program becoming obsolete due to changing requirements/maintenance. • What constitutes aging? • Deterioration in the size of free OS resources • Accumulation of internal errors • Common examples • Memory leaks • Data corruption • Fragmentation • Round-off errors
Software Aging – Examples • Netscape, xrn, Windows 9x • LAX airport shutdown (Sep 12, 2004) • File system aging [Smith & Seltzer] • Crash/hang failures in general purpose applications • Gradual service degradation in the AT&T transaction processing system [Avritzer et al.] • Error accumulation in Patriot missile system’s software [Marshall] • Resource exhaustion in Apache [Li et al.]
Software Fault Classification • Bohrbugs: Software bugs that are reproducible, easily found and (often) fixed during the testing and debugging phase • Mandelbugs: Software bugs that are hard to find and fix; (often) remain in the software during the operational phase • These bugs may never be fixed, but if the operation is retried or the system is rebooted, the bugs may not manifest themselves as failures • manifestation is non-deterministic and dependent on the software (or its environment) reaching very rare states • Yet another cause software failures is resource exhaustion, e.g., memory leakage, swap space fragmentation Software appears to “Age” due to resource exhaustion
Bohrbugs Mandelbugs Aging related Bugs Software Fault Classification
Software (OS, recovery s/w, applications) Bohrbugs Mandelbugs “Aging” related bugs Test/ Debug Retry opn. Des./Data Diversity Restart app. Reboot node Design/ Development Operational Software Fault Classification
Reactive in approach Environment DiversityNew Approach to S/W FT • Transient nature of software failures • [Gray] Bohrbugs and Mandelbugs (or Heisenbugs) • [Lee & Iyer] Tandem GUARDIAN – 70% transient faults • [Sullivan & Chillarege] IBM’s system software – most failures caused by peak conditions in workload, timing and exception errors • Environmental Diversity • Allows the use of time redundancy over expensive design diversity • [Adams] [Grey] [Siewiorek] Restart • [Jalote et al.] Rollback, rollforward • [Wang et al.] Progressive retry • [folklore] Occasional reboot, “switch off and on” • Proactive approach • Software rejuvenation
Software RejuvenationDefinition • Proactive fault management technique aimed at postponing/preventing crash failures and/or performance degradation • Involves occasionally stopping the running software, “cleaning” its internal state and/or its environment and restarting it • Rejuvenation of the environment, not of software • Counteracts the aging phenomenon • Frees up OS resources • Removes error accumulation • Common techniques for cleaning • Garbage collection, defragmentation, flushing kernel and file server tables etc
Software Rejuvenation – Examples • AT&T billing applications [Huang et al.] • JPL REE System • Patriot missile system software - switch off and on every 8 hours [Marshall] • On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.] • IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers] • Microsoft IIS 5.0 process recycling tool • Process restart in Apache [Li et al.]
Software Rejuvenation – Trade-off • Advantages • Reduces costs of sudden aging-related failures. • Can be applied at the discretion of the user/ administrator, e.g., in the middle of the night. • Disadvantages • Direct costs of carrying out rejuvenation • Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions, etc.) Important research issue: Find optimal times to perform rejuvenation!
Outline • Introduction & Motivation • Software reliability and fault tolerance • Software Aging and Rejuvenation • Analytic Models • CTMC model • MRSPN model • SMP model • Cluster systems – SRN model • Degradation model • Measurement-based Models • Time-based • Time and workload-based • Software Rejuvenation in a Commercial Server • Summary and Conclusions
Software Rejuvenation – Trade-off • Two approaches • Use analytical model to optimize rejuvenation schedule • Lucent Bell Labs [Huang et al., ’95] • Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00 SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEETR05] • Others [IPDS’98, PNPM’99] • Use measurements of resource degradation to determine optimal rejuvenation schedule • Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEETPDS05]
Analytic Models • Single node models • Condition based • CTMC model • SMP model • Time-based • MRSPN model • Cluster systems • IBM Cluster model (Time-based, condition-based) • Motorola CMTS Model • Degradation model to analytically show that TTF is IFR
Failure rate Preventive maintenance is useful only if failure rate is increasing If the time to failure distribution is exponential then failure rate is Constant Need to assume (and establish) that TTF is IFR
A simple and useful model of increasing failure rate: Failure probable state Robust state Failed state • Time to failure: Hypo-exponential distribution • Increasing failure rate aging Analytic ModelsSoftware Aging and Rejuvenation
Analytic ModelsCTMC model [Huang95] Robust state Robust state Failed state Failed state Failure probable state Failure probable state Rejuvenation state Model with rejuvenation Model w/o rejuvenation • From this Continuous-time Markov chain model • Can find closed-form expression for • the optimal rejuvenation trigger rate (r4)
Analytic ModelsSemi-Markov model [Dohi00] • Relax the assumption of exponentially distributed • sojourn times (time-independent transition rates) • Hence have a semi Markov model • Can find closed-form expression for the • optimal (deterministic) time to rejuvenation • trigger
Analytic ModelsMRSPN model [Garg95] • If degraded state cant be determined • Allow the rejuvenation trigger clock to start • in the robust state, we obtain a Markov Regenerative Process
Analytic ModelsMRSPN model [Garg95] • Optimal time (deterministic) to rejuvenation trigger is determined • numerically
Cluster Systems • [Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself • Easier scaling to larger systems, high levels of availability/performance and low management costs • No single point of failure • Node failures transparent to users • Graceful repairs, shutdowns, upgrades Cluster System
Rejuvenation for Cluster SystemsMotivation • Rejuvenation using the fail-over mechanisms • Long-terms benefits in terms of availability/performance • Continuous operation (possibly at a degraded level) • Practically zero downtime • Less disruptive and lower overhead than unplanned outages • Transparent to user/application • Most current industry initiatives reactive • Two approaches • Simple time-based (periodic) • Condition-based (only from the “failure-impending” state)
Rejuvenation for Cluster SystemsSRN Models • Rejuvenation using the fail-over mechanisms in a rolling fashion • Modeling using SRNs (Stochastic Reward Nets) • Analysis for 2 rejuvenation policies • Simple time-based policy • All nodes rejuvenated successively at the end of each rejuvenation interval • Condition-based policy • Nodes rejuvenated only from the “failure-probable” state • Various configurations • a/b: cluster with a nodes that can tolerate at the most b individual node failures, i.e., (a-b)-out-of-a system • Model solution • SPNP (Stochastic Petri Net Package)
Model Parameters Measures Computed
ResultsSimple Time-Based Rejuvenation Effect of costnodefail/costrejuv for the 8/1 configuration • Cost of node failure is fixed • Decrease in cost ratio implies increase in cost of rejuvenation • Hence, decrease in cost ratio increases total expected cost • As rejuvenation interval increases, rejuvenation is performed less frequently • As rejuvenation tends to infinity, almost no rejuvenation is performed and all the plots tend to the same value
ResultsSimple Time-Based Rejuvenation Expected cost for various configurations
RecapAnalysis of Rejuvenation for Cluster Systems • Huge benefit in terms of UA and cost improvement for systems with more than one spare • Simple time-based policy better than prediction-based for some cases • Condition policy much better for large node repair times and low node-failure coverage • Future work • Consider other performability measures • Explore non-ideal effects of common-mode failure and node-failure coverage
Application to CMTSExample [issre02] • Cable modem system and broadband access • Most popular & promising high speed Internet access • Success lies in the widespread HFC cable networks and the industry standard DOCSIS • High availability requirement of CMTS • Cable modem termination system (CMTS) is the most complex and crucial component of the system • Existing approaches only provide hardware redundancy • Current systems cannot achieve 5 nines availability • Our proposed approach and contributions • Propose software rejuvenation in CMTS cluster system • Construct analytic models, obtain numerical results, optimize rejuvenation parameters, and show the benefits
Basic model without rejuvenation HW failure detection, switchover, repair, and giveback HW failure detection and repair PCMTS SCMTS HW & SW failures HW & SW failures SW failure detection, switchover, reboot, and giveback SW failure detection and reboot
Model with rejuvenation PCMTS Rejuvenation for the robust and “aged” nodes Same as the basic system Rejuvenation, switchover, and giveback Rejuvenation for the robust and “aged” nodes Timer SCMTS Approximate deterministic timer interval by r-stage Erlang distribution Rejuvenation Same as the basic system
Degradation model [DSN 03,IEEETR05] • Explicitly connecting resource leaks with failure rate and hence aging
Problem Parameters • Total amount of resource: • Resource request arrival rate: • Resource release rate: • Accumulated resource leak at time t: • Number of processes in the system: • Conditional probability that system fails to honor the resource request at state k upon the arrival of new request:
Degradation Model (cont.’d) • Failure rate: • Conditional probability: • Homogeneous CTMC (leakless) • Non-homogeneous (leak-present)
Degradation Analysis • Asymptotically constant failure rate (leakless)
Degradation Analysis • Monotonic degradation (leak-present)
Outline • Introduction & Motivation • Software reliability and fault tolerance • Software Aging and Rejuvenation • Analytic Models • CTMC model • MRSPN model • SMP model • Cluster systems – SRN model • Degradation model • Measurement-based Models • Time-based • Time and workload-based • Software Rejuvenation in a Commercial Server • Summary and Conclusions
Measurement-Based Approach • Objective • Detection and validation of aging • Periodically monitor and collect data on the attributes responsible for the “health” of the system • Quantify the effect of aging on system resources • Proposed metric – Estimated time to exhaustion • Three approaches • Time-based (workload-independent) estimation [Garg98] • Workload-based estimation [Vaidyanathan99,TDSC05] • ARMA/ARX models [Li02]