1 / 16

Software Rejuvenation

Software Rejuvenation. Vittorio Castelli Rick Harper Phil Heidelberger Steve Hunter Tom Pahel Kalyan Vaidyanathan. Objectives. Improve system availability Software-induced outages dominate hardware-induced outages A top concern of most customers

Download Presentation

Software Rejuvenation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Rejuvenation Vittorio Castelli Rick Harper Phil Heidelberger Steve Hunter Tom Pahel Kalyan Vaidyanathan

  2. Objectives • Improve system availability • Software-induced outages dominate hardware-induced outages • A top concern of most customers • Proactive fault management is greatly preferred, replacing unplanned outages with planned outages • There are many problems...we chose to attack software aging • Predict and avoid unplanned outages due to software aging • Monitor consumption of resources such as free memory, swap space, handle count, thread count, inode count, ... • Extrapolate resource consumption within a user-specified horizon • When exhaustion is predicted, produce "event" into IBM Director, which can cause alert, selective rejuvenation, cluster failover, or reboot • In some cases can identify which process/subsystem is the culprit

  3. Software Aging • Software state (OS, middleware, applications) decays with time... • memory leaks • handle leaks • nonterminated threads • unreleased file-locks • data corruption • ...resulting in Bad Things (outages, hangs, performance degradation) • We feel this behavior is an inevitable by-product of software industry dynamics and practice • Software failure prediction and state rejuvenation is a proactive technology designed to mitigate the effects of software aging • Predict when resource exhaustion is about to occur • Reset the state of the system to an initial low-resource-consumption condition

  4. Project History • Supported in 2000 and 2001 by PSI funding • xSeries, Research, University collaboration • xSeries architecture (Steve Hunter), RAS, Development (Tom Pahel) Marketing • Research: Rick Harper, Vittorio Castelli and Phil Heidelberger • Duke: Kalyan Vaidyanathan, Kishor Trivedi • Incorporated into IBM Director • Timed rejuvenation on NT GA Q4'99 • Predictive rejuvenation on NT/W2K GA Q4'00 (NT includes per-process diagnosis) • Predictive rejuvenation and per-process diagnosis on Linux GA Q4'01 • The market liked it

  5. Software Rejuvenation Agent Prediction Algorithm

  6. Prediction Algorithm • Sampled Parameters • Windows agent can predict exhaustion of committed bytes, pool nonpaged bytes, pool paged bytes, logical disk bytes • Linux agent: swap space, disk space, inodes, file descriptors, processes • Sampling Technique • User selects exhaustion notification horizon • Typically should be at least several days • Agent sets up sliding sampling window that is 1/10 the size of the horizon • Agent sets up sampling rate such that 300 points lie within sampling window • Can perform linear prediction using 200 points, more complex predictions require 300 points • Historical data is saved, subject to user-selected file size limitation • Predictive algorithm • Constructs 6 candidate fitted curves to smoothed sliding window data • Linear, Log, Linear/Log with 2 or 3 breakpoints • Selects best-fitting curve • Extrapolates selected curve out to exhaustion horizon • Generates event if extrapolated data impacts limits within horizon, and indicates how long until impact

  7. Example of Algorithm Execution

  8. Diagnosis • Process Consumption of Nonpaged Pool Bytes: • SERVICES 447936 2.51% • WINLOGON 64992 0.36% • WinMgmt 57068 0.32% • svchost 47448 0.27% • explorer 45896 0.26% • svchost 44704 0.25% • CSRSS 42416 0.24% • LSASS 40708 0.23% • msdtc 35608 0.20% • rtvscan 34448 0.19% • System Module Consumption of NonPaged Pool Bytes: • Tag LSwi 2293760 0.124 • Tag File 2027424 0.110 • Tag Wdm 1705888 0.092 • Tag MmCa 1371744 0.074 • Tag Ntfr 1350112 0.073 • Tag Nmdd 1048576 0.057 • Tag NtFs 753888 0.041 • Tag Ntfn 750080 0.041 • Tag NDam 612608 0.033 • Tag FSfm 541888 0.029 • Tag Dmio 532448 0.029 • Of a total of 35667968 Pool Nonpaged Bytes, 1318904 (3.70%%) can be diagnosed to processes and 34349064 (96.30%%) are consumed by system modules.

  9. Problem: False Alarms due to Temporary Surges

  10. Transition from Un-Notified to Notified: Ready to Notify

  11. Transition from Un-Notified to Notified: Notify

  12. Software Rejuvenation Agent Director Integration

  13. High Level Design IBM Director Console Director Tasks: Inventory, Events, … IPC Software Rejuvenation Task Director Management Server • Console Task is used to configure SW Rejuv Options & Criteria • Server Task saves persistent configuration data and communicates with agent machine • The agent monitors OS usage of resources, projects future exhaustion, & notifies server if exhaustion is imminent. Topology Engine SNMP Device Director Clients Cluster Servers Rack Device Other Add-on Topology Extensions Director Server Tasks: Inventory, Event, Monitors, FileTransfer, Scheduler, CIM... Software Rejuvenation Server Task (Persistent Data) DataBase IPC eServer Box Director IPC Agent Software Rejuvenation Sub-Agent Sub-Agents: Events, Inventory, Monitors,... Configuration (Input/Output) Files Plug-ins: Inventory, Monitors Operating System, Device Drivers Service Processor, ServeRAID

  14. Console Task Verify Console Installation:

  15. Task Interface • Trend Viewer for systems w/prediction • Schedule Filter to prevent rejuvenation on specified days • Drag-n-Drop services for time based rejuvenation • Rejuvenation Options apply to clusters only

  16. Conclusions • The xSeries Software Rejuvenation project only attacked a small fraction of system outage causes, yet was well received • Much remains... • Current technology based on lab testing and a priori understanding of exhaustible resources • A very limited class of outage causes • Adaptive identification of pre-outage signatures • Improved diagnostic resolution • Selective rejuvenation of offending subsystem • Expand to more general classes of software failures and syndromes • Multiparameter signatures • Non-extremal conditions • Misconfigurations • Event log analysis • Applications • Workload balancing • HW/SW fault discrimination • SW testing and hardening

More Related