200 likes | 309 Views
Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF. Igor Trubin, Ph.D. and Linwood Merritt Capital One Services, Inc. igor.trubin@capitalone.com. Introduction: Environment. Capital One 6th largest card issuer in the United States
E N D
Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt Capital One Services, Inc. igor.trubin@capitalone.com Page 1
Introduction: Environment • Capital One • 6th largest card issuer in the United States • Capital One to S&P 500 in 1998 • Fortune 500 company starting in 2000 • Managed loans at $71.8 billion • Accounts at 46.7 million • CIO 100 Award “Master of the Customer Connection” • Information Week “Innovation 100” Award Winner • ComputerWorld “Top 100 places to work in IT” Page 2
Statistical Analysis of Mainframe Performance Data • SEDS - Statistical Exception Detection System based on Multivariate Adaptive Statistical Filtering (MASF) technique. • SEDS is used for automatically scanning through large volumes of performance data and identifying measurements that differ significantly from their expected values. • MASF is extension of Statistical Process Control or (Quality Control), which was developed by Walter Shewhart of Bell Telephone Laboratories in the 1920s. • MASF procedure was designed and presented in CMG by BGS Systems, Inc. in 1995. • SEDS is developed by this author and presented as the best paper in CMG 2002. Page 3
Review of the Existing Tools • SAS/QC (Quality Control): • JMP from SAS: • BEZsystems for Oracle and Teradata; • Concord eHealth – DFN (Deviation From Normal) • The Patrol Perform and Predict tool from BMC software: The common output is Control charts for monitoring variations in process under statistical control Page 4
SEDS Structure • Exception detectors for the most important metrics; • SEDS Database with history of exceptions; • statistical process control daily profile chart generator; • exception server name list generator; • Leader/Outsider servers/workload detector and detector of defective (runaway) processes ; and • Leaders/Outsiders bar charts generator. Page 5
CPU UtilizationControl Chart for Web Report: The full "7 days X 24 hours” adaptive filtering policy is applied to calculate the average, upper, and lower statistical limits of a particular metric for each weekday for the past six months. Page 6
SEDS against Unisys and Tandem Platforms Performance Data SEDS works with hourly or daily performance data. The schemas of the “day” tables in ITRM for Unisys and Tandem platforms are shown here. Good candidates to be used for SEDS are marked by red. Page 7
Examples of Captured Exceptions for Unisys and Tandem The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit. The Unisys server had unusual low utilization that might indicate Disk or Database performance problems Page 8
Global Performance Data for MVS Platform The schemas of the “day” tables in ITRM for MVS platforms are shown in the Table Good candidates for use in SEDS are marked by red • A set of nightly batch jobs • dumps remaining active accounting data, • consolidates the data, • processes the data in SAS and • updates the ITRM PDB Page 9
Examples of Captured Exceptions for One of the Logical Partitions (LPAR) Since this chart is not about the entire system’s utilization but only about LPAR utilization within a shared system, the problem is that 100% is not a true threshold. However, SEDS gives a more accurate and dynamic threshold which is a statistical one. Page 10
BMC Visualizer MASF vs. SEDS You can use BMC Visualizer to find any other exceptions based on other filtering policies. For that, the BMC collector needs to be installed on the server and BMC Visualizer must be used manually to capture any MASF exceptions. SEDS is preferable as the automated MASF chart generator. In addition, SEDS can automatically notify a performance analyst if the statistical exception occurred BMC Visualizer example: the System Hierarchy (spectrum) and Control charts Page 11
Application Level SEDS for MVS Platform One problem is that, based on LPAR level data, it is impossible to figure out what particular workloads are responsible for an exception. BUT the Data collection process provides application level data across all LPARs. SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart . Looking at a stacked workload data chart , it’s difficult to find an application, which is responsible for spikes in overall CPU usage. Page 12
Other Reasons to Generate a Workload Control Chart 1. To capture an unusual behavior of a relatively small application that was not big enough to create a global exception: 2. To prove a stable behavior of any essential or critical application: Page 13
Service Class/Period Type of Metrics under SEDS - Hourly SUM of ended transaction count - TRANS - Hourly SUM of the average response per transaction - RESP, (It shows the values consistently larger than average) - Hourly SUM of elapsed tasks duration - CPUsec (not always reported correctly for long-running servers ) ElapsedSec = (number of tasks) * 3600 seconds. Page 14
Performance Status Automatic Recognition, WEB Report and E-mail Notification the number of applications or Service Classes with exceptions A green color in the WEB table indicates no exceptions. AMagenta indicates that the exceptions only exceeded the lower limit. A yellow color means an exception occurred on a particular server or LPAR. (NUP - NLOW) – Is the severity or type of the exceptions under the link to an MASF chart, where NUP – number of upper limit exceptions and NLOW – number of lower limit exceptions during the previous day. Page 15
Exception Database and “Extra Volume” Metric ExtraVolume is the numeric estimation of the exception magnitude.For CPU utilization it’s an ExtraTime: The SEDS database keeps history of exceptions and has the following structure: • It calculates the area between the limit curve and the actual data curve (for periods when the exceptions occurred). • For CPU metrics the physical meaning is the CPU time (or MIPS) the server has taken that exceeds a standard deviation. Page 17
TOP LPAR Leaders/Outsiders Charts • The system automatically produces ExtraTime calculation for the last day and records that in the SEDS database. • This data is used for publishing Leaders/Outsiders charts bar charts for the last day, last week and last month. If the SERVER showed a positive ExtraVolume for the previous day, it means that more capacity was used on the server than in the past. If the server showed a negative ExtraVolumemetric, less capacity was used than usual. (not necessarily good thing) Page 18
SUMMARY • Statistical techniques can be used to automatically detect and report exceptions in resource utilization and service levels. • The author’s site previously used MASF techniques to track global and application level CPU, disk and memory exceptions for a large number of UNIX and WINTEL servers . • The workload level analysis enabled the authors’ site to expand the scope of this process to encompass large mainframe class servers. • Although the analysis of global exceptions at an LPAR level has limited value for a system that shares workloads across logical systems, a workload-oriented system allows for quick detection of exceptions and immediate drill-down capabilities for the Capacity Planner and Performance Analyst. • The authors recommend that the reader evaluate and understand any built-in statistical processes within his/her product set and consider developing ways to notify appropriate analysts when exceptions occur. Page 19
References • Trubin, Igor, Ph. D. and Mclaughlin, Kevin, “Exception Detection System, Based on the Statistical Process Control Concept," Proceedings of the Computer Measurement Group, 2001 • Trubin, Igor, Ph. D., "Global and Application level Exception Detection System, Based on the MASF Technique," Proceedings of the Computer Measurement Group, 2002 Thanks! Igor Trubin IT Capacity Planning, Capital One Services, Inc. igor.trubin@capitalone.com Page 20