160 likes | 308 Views
Software Rejuvenation. Vittorio Castelli Rick Harper Phil Heidelberger Steve Hunter Tom Pahel Kalyan Vaidyanathan. Objectives. Improve system availability Software-induced outages dominate hardware-induced outages A top concern of most customers
E N D
Software Rejuvenation Vittorio Castelli Rick Harper Phil Heidelberger Steve Hunter Tom Pahel Kalyan Vaidyanathan
Objectives • Improve system availability • Software-induced outages dominate hardware-induced outages • A top concern of most customers • Proactive fault management is greatly preferred, replacing unplanned outages with planned outages • There are many problems...we chose to attack software aging • Predict and avoid unplanned outages due to software aging • Monitor consumption of resources such as free memory, swap space, handle count, thread count, inode count, ... • Extrapolate resource consumption within a user-specified horizon • When exhaustion is predicted, produce "event" into IBM Director, which can cause alert, selective rejuvenation, cluster failover, or reboot • In some cases can identify which process/subsystem is the culprit
Software Aging • Software state (OS, middleware, applications) decays with time... • memory leaks • handle leaks • nonterminated threads • unreleased file-locks • data corruption • ...resulting in Bad Things (outages, hangs, performance degradation) • We feel this behavior is an inevitable by-product of software industry dynamics and practice • Software failure prediction and state rejuvenation is a proactive technology designed to mitigate the effects of software aging • Predict when resource exhaustion is about to occur • Reset the state of the system to an initial low-resource-consumption condition
Project History • Supported in 2000 and 2001 by PSI funding • xSeries, Research, University collaboration • xSeries architecture (Steve Hunter), RAS, Development (Tom Pahel) Marketing • Research: Rick Harper, Vittorio Castelli and Phil Heidelberger • Duke: Kalyan Vaidyanathan, Kishor Trivedi • Incorporated into IBM Director • Timed rejuvenation on NT GA Q4'99 • Predictive rejuvenation on NT/W2K GA Q4'00 (NT includes per-process diagnosis) • Predictive rejuvenation and per-process diagnosis on Linux GA Q4'01 • The market liked it
Software Rejuvenation Agent Prediction Algorithm
Prediction Algorithm • Sampled Parameters • Windows agent can predict exhaustion of committed bytes, pool nonpaged bytes, pool paged bytes, logical disk bytes • Linux agent: swap space, disk space, inodes, file descriptors, processes • Sampling Technique • User selects exhaustion notification horizon • Typically should be at least several days • Agent sets up sliding sampling window that is 1/10 the size of the horizon • Agent sets up sampling rate such that 300 points lie within sampling window • Can perform linear prediction using 200 points, more complex predictions require 300 points • Historical data is saved, subject to user-selected file size limitation • Predictive algorithm • Constructs 6 candidate fitted curves to smoothed sliding window data • Linear, Log, Linear/Log with 2 or 3 breakpoints • Selects best-fitting curve • Extrapolates selected curve out to exhaustion horizon • Generates event if extrapolated data impacts limits within horizon, and indicates how long until impact
Diagnosis • Process Consumption of Nonpaged Pool Bytes: • SERVICES 447936 2.51% • WINLOGON 64992 0.36% • WinMgmt 57068 0.32% • svchost 47448 0.27% • explorer 45896 0.26% • svchost 44704 0.25% • CSRSS 42416 0.24% • LSASS 40708 0.23% • msdtc 35608 0.20% • rtvscan 34448 0.19% • System Module Consumption of NonPaged Pool Bytes: • Tag LSwi 2293760 0.124 • Tag File 2027424 0.110 • Tag Wdm 1705888 0.092 • Tag MmCa 1371744 0.074 • Tag Ntfr 1350112 0.073 • Tag Nmdd 1048576 0.057 • Tag NtFs 753888 0.041 • Tag Ntfn 750080 0.041 • Tag NDam 612608 0.033 • Tag FSfm 541888 0.029 • Tag Dmio 532448 0.029 • Of a total of 35667968 Pool Nonpaged Bytes, 1318904 (3.70%%) can be diagnosed to processes and 34349064 (96.30%%) are consumed by system modules.
Software Rejuvenation Agent Director Integration
High Level Design IBM Director Console Director Tasks: Inventory, Events, … IPC Software Rejuvenation Task Director Management Server • Console Task is used to configure SW Rejuv Options & Criteria • Server Task saves persistent configuration data and communicates with agent machine • The agent monitors OS usage of resources, projects future exhaustion, & notifies server if exhaustion is imminent. Topology Engine SNMP Device Director Clients Cluster Servers Rack Device Other Add-on Topology Extensions Director Server Tasks: Inventory, Event, Monitors, FileTransfer, Scheduler, CIM... Software Rejuvenation Server Task (Persistent Data) DataBase IPC eServer Box Director IPC Agent Software Rejuvenation Sub-Agent Sub-Agents: Events, Inventory, Monitors,... Configuration (Input/Output) Files Plug-ins: Inventory, Monitors Operating System, Device Drivers Service Processor, ServeRAID
Console Task Verify Console Installation:
Task Interface • Trend Viewer for systems w/prediction • Schedule Filter to prevent rejuvenation on specified days • Drag-n-Drop services for time based rejuvenation • Rejuvenation Options apply to clusters only
Conclusions • The xSeries Software Rejuvenation project only attacked a small fraction of system outage causes, yet was well received • Much remains... • Current technology based on lab testing and a priori understanding of exhaustible resources • A very limited class of outage causes • Adaptive identification of pre-outage signatures • Improved diagnostic resolution • Selective rejuvenation of offending subsystem • Expand to more general classes of software failures and syndromes • Multiparameter signatures • Non-extremal conditions • Misconfigurations • Event log analysis • Applications • Workload balancing • HW/SW fault discrimination • SW testing and hardening