Software Aging & Rejuvenation: Ensuring System Reliability

Software Aging & Rejuvenation DA-IICT Gandhinagar January 9, 2006 Kishor S. Trivedi Dept. of Electrical & Computer Engineering Duke University Durham, NC 27708 kst@ee.duke.edu www.ee.duke.edu/~kst www.software-rejuvenation.com

Outline • Introduction & Motivation • Software reliability and fault tolerance • Software Aging and Rejuvenation • Analytic Models • CTMC model • MRSPN model • SMP model • Cluster systems – SRN model • Degradation model • Measurement-based Models • Time-based • Time and workload-based • Software Rejuvenation in a Commercial Server • Summary and Conclusions

Motivation: Dependence on Computer Systems Communication Health & Medicine Avionics Entertainment Banking

Downtown Costs per Hour • Brokerage operations $6,450,000 • Credit card authorization $2,600,000 • eBay (1 outage 22 hours) $225,000 • Amazon.com $180,000 • Package shipping services $150,000 • Home shopping channel $113,000 • Catalog sales center $90,000 • Airline reservation center $89,000 • Cellular service activation $41,000 • On-line network fees $25,000 • ATM service fees $14,000 Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."

Softwarereliability is one of the weakest links in system reliability High Availability:Software is the problem (1) • Hardware fault tolerance, fault management, reliability/availability modeling relatively well developed • System outages more due to software faults

Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software High Availability:Software is the problem (2) • Fault avoidance through good software engineering practices difficult for large/complex software systems • Impossible to fully test and verify if software is fault-free • Yet there are stringent requirements for failure-free operation

Software Fault ToleranceTechniques • Design diversity • N-version programming • Recovery block • N-self check programming • Data diversity • N-copy programming

Design diversity based software fault tolerance expensive Data diversity may have limited applicability Stringent requirements for failure-free operation Software remains the problem

Software Fault ToleranceNew thinking • Environment diversity • Checkpointing and rollback, retry • Helps in dealing with hardware transients • But also helps in dealing with software bugs • Replication of software modules (applications) • Does it help? If yes, why? • Proactive fault management (software rejuvenation) • Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes. It allows proactive repairs to be carried at the discretion of the user/administrator, e.g., in the middle of the night

Software Aging • “Software Aging” phenomenon • Long-running software tends to show an increasing failure rate. • Not related to application program becoming obsolete due to changing requirements/maintenance. • What constitutes aging? • Deterioration in the size of free OS resources • Accumulation of internal errors • Common examples • Memory leaks • Data corruption • Fragmentation • Round-off errors

Software Aging – Examples • Netscape, xrn, Windows 9x • LAX airport shutdown (Sep 12, 2004) • File system aging [Smith & Seltzer] • Crash/hang failures in general purpose applications • Gradual service degradation in the AT&T transaction processing system [Avritzer et al.] • Error accumulation in Patriot missile system’s software [Marshall] • Resource exhaustion in Apache [Li et al.]

Software Fault Classification • Bohrbugs: Software bugs that are reproducible, easily found and (often) fixed during the testing and debugging phase • Mandelbugs: Software bugs that are hard to find and fix; (often) remain in the software during the operational phase • These bugs may never be fixed, but if the operation is retried or the system is rebooted, the bugs may not manifest themselves as failures • manifestation is non-deterministic and dependent on the software (or its environment) reaching very rare states • Yet another cause software failures is resource exhaustion, e.g., memory leakage, swap space fragmentation Software appears to “Age” due to resource exhaustion

Bohrbugs Mandelbugs Aging related Bugs Software Fault Classification

Software (OS, recovery s/w, applications) Bohrbugs Mandelbugs “Aging” related bugs Test/ Debug Retry opn. Des./Data Diversity Restart app. Reboot node Design/ Development Operational Software Fault Classification

Reactive in approach Environment DiversityNew Approach to S/W FT • Transient nature of software failures • [Gray] Bohrbugs and Mandelbugs (or Heisenbugs) • [Lee & Iyer] Tandem GUARDIAN – 70% transient faults • [Sullivan & Chillarege] IBM’s system software – most failures caused by peak conditions in workload, timing and exception errors • Environmental Diversity • Allows the use of time redundancy over expensive design diversity • [Adams] [Grey] [Siewiorek] Restart • [Jalote et al.] Rollback, rollforward • [Wang et al.] Progressive retry • [folklore] Occasional reboot, “switch off and on” • Proactive approach • Software rejuvenation

Software RejuvenationDefinition • Proactive fault management technique aimed at postponing/preventing crash failures and/or performance degradation • Involves occasionally stopping the running software, “cleaning” its internal state and/or its environment and restarting it • Rejuvenation of the environment, not of software • Counteracts the aging phenomenon • Frees up OS resources • Removes error accumulation • Common techniques for cleaning • Garbage collection, defragmentation, flushing kernel and file server tables etc

Software Rejuvenation – Examples • AT&T billing applications [Huang et al.] • JPL REE System • Patriot missile system software - switch off and on every 8 hours [Marshall] • On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.] • IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers] • Microsoft IIS 5.0 process recycling tool • Process restart in Apache [Li et al.]

Software Rejuvenation – Trade-off • Advantages • Reduces costs of sudden aging-related failures. • Can be applied at the discretion of the user/ administrator, e.g., in the middle of the night. • Disadvantages • Direct costs of carrying out rejuvenation • Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions, etc.) Important research issue: Find optimal times to perform rejuvenation!

Software Rejuvenation – Trade-off • Two approaches • Use analytical model to optimize rejuvenation schedule • Lucent Bell Labs [Huang et al., ’95] • Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00 SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEETR05] • Others [IPDS’98, PNPM’99] • Use measurements of resource degradation to determine optimal rejuvenation schedule • Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEETPDS05]

Analytic Models • Single node models • Condition based • CTMC model • SMP model • Time-based • MRSPN model • Cluster systems • IBM Cluster model (Time-based, condition-based) • Motorola CMTS Model • Degradation model to analytically show that TTF is IFR

Failure rate Preventive maintenance is useful only if failure rate is increasing If the time to failure distribution is exponential then failure rate is Constant Need to assume (and establish) that TTF is IFR

A simple and useful model of increasing failure rate: Failure probable state Robust state Failed state • Time to failure: Hypo-exponential distribution • Increasing failure rate aging Analytic ModelsSoftware Aging and Rejuvenation

Analytic ModelsCTMC model [Huang95] Robust state Robust state Failed state Failed state Failure probable state Failure probable state Rejuvenation state Model with rejuvenation Model w/o rejuvenation • From this Continuous-time Markov chain model • Can find closed-form expression for • the optimal rejuvenation trigger rate (r4)

Analytic ModelsCTMC model (Huang95)

Analytic ModelsSemi-Markov model [Dohi00] • Relax the assumption of exponentially distributed • sojourn times (time-independent transition rates) • Hence have a semi Markov model • Can find closed-form expression for the • optimal (deterministic) time to rejuvenation • trigger

Analytic ModelsSemi-Markov model (Dohi00)

Analytic ModelsMRSPN model [Garg95] • If degraded state cant be determined • Allow the rejuvenation trigger clock to start • in the robust state, we obtain a Markov Regenerative Process

Analytic ModelsMRSPN model [Garg95] • Optimal time (deterministic) to rejuvenation trigger is determined • numerically

Cluster Systems • [Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself • Easier scaling to larger systems, high levels of availability/performance and low management costs • No single point of failure • Node failures transparent to users • Graceful repairs, shutdowns, upgrades Cluster System

Rejuvenation for Cluster SystemsMotivation • Rejuvenation using the fail-over mechanisms • Long-terms benefits in terms of availability/performance • Continuous operation (possibly at a degraded level) • Practically zero downtime • Less disruptive and lower overhead than unplanned outages • Transparent to user/application • Most current industry initiatives reactive • Two approaches • Simple time-based (periodic) • Condition-based (only from the “failure-impending” state)

Rejuvenation for Cluster SystemsSRN Models • Rejuvenation using the fail-over mechanisms in a rolling fashion • Modeling using SRNs (Stochastic Reward Nets) • Analysis for 2 rejuvenation policies • Simple time-based policy • All nodes rejuvenated successively at the end of each rejuvenation interval • Condition-based policy • Nodes rejuvenated only from the “failure-probable” state • Various configurations • a/b: cluster with a nodes that can tolerate at the most b individual node failures, i.e., (a-b)-out-of-a system • Model solution • SPNP (Stochastic Petri Net Package)

SRN ModelBasic Cluster Model

SRN ModelSimple Time-Based Rejuvenation

Model Parameters Measures Computed

ResultsSimple Time-Based Rejuvenation Effect of costnodefail/costrejuv for the 8/1 configuration • Cost of node failure is fixed • Decrease in cost ratio implies increase in cost of rejuvenation • Hence, decrease in cost ratio increases total expected cost • As rejuvenation interval increases, rejuvenation is performed less frequently • As rejuvenation tends to infinity, almost no rejuvenation is performed and all the plots tend to the same value

ResultsSimple Time-Based Rejuvenation Expected cost for various configurations

RecapAnalysis of Rejuvenation for Cluster Systems • Huge benefit in terms of UA and cost improvement for systems with more than one spare • Simple time-based policy better than prediction-based for some cases • Condition policy much better for large node repair times and low node-failure coverage • Future work • Consider other performability measures • Explore non-ideal effects of common-mode failure and node-failure coverage

Application to CMTSExample [issre02] • Cable modem system and broadband access • Most popular & promising high speed Internet access • Success lies in the widespread HFC cable networks and the industry standard DOCSIS • High availability requirement of CMTS • Cable modem termination system (CMTS) is the most complex and crucial component of the system • Existing approaches only provide hardware redundancy • Current systems cannot achieve 5 nines availability • Our proposed approach and contributions • Propose software rejuvenation in CMTS cluster system • Construct analytic models, obtain numerical results, optimize rejuvenation parameters, and show the benefits

Basic model without rejuvenation HW failure detection, switchover, repair, and giveback HW failure detection and repair PCMTS SCMTS HW & SW failures HW & SW failures SW failure detection, switchover, reboot, and giveback SW failure detection and reboot

Model with rejuvenation PCMTS Rejuvenation for the robust and “aged” nodes Same as the basic system Rejuvenation, switchover, and giveback Rejuvenation for the robust and “aged” nodes Timer SCMTS Approximate deterministic timer interval by r-stage Erlang distribution Rejuvenation Same as the basic system

Degradation model [DSN 03,IEEETR05] • Explicitly connecting resource leaks with failure rate and hence aging

Problem Parameters • Total amount of resource: • Resource request arrival rate: • Resource release rate: • Accumulated resource leak at time t: • Number of processes in the system: • Conditional probability that system fails to honor the resource request at state k upon the arrival of new request:

Degradation Model

Degradation Model (cont.’d) • Failure rate: • Conditional probability: • Homogeneous CTMC (leakless) • Non-homogeneous (leak-present)

Degradation Analysis • Asymptotically constant failure rate (leakless)

Degradation Analysis • Monotonic degradation (leak-present)

Measurement-Based Approach • Objective • Detection and validation of aging • Periodically monitor and collect data on the attributes responsible for the “health” of the system • Quantify the effect of aging on system resources • Proposed metric – Estimated time to exhaustion • Three approaches • Time-based (workload-independent) estimation [Garg98] • Workload-based estimation [Vaidyanathan99,TDSC05] • ARMA/ARX models [Li02]

Software Aging & Rejuvenation: Ensuring System Reliability

Software Aging & Rejuvenation: Ensuring System Reliability

Presentation Transcript