360 likes | 377 Views
The Mathematics of Performance Management and Capacity Planning - Overview Descriptive and Predictive Analytics in the Age of Virtual Systems. Tim Browning Presented at the Greater Atlanta Computer Measurement Group Fall Conference , October 22, 2008. On Mathematics & Statistics.
E N D
The Mathematics of Performance Management and Capacity Planning - OverviewDescriptive and PredictiveAnalyticsin the Age of Virtual Systems Tim Browning Presented at the Greater Atlanta Computer Measurement Group Fall Conference, October 22, 2008
On Mathematics & Statistics There are two kinds of statistics, the kind you look up and the kind you make up. ~Rex Stout, Death of a Doxy How many times can you subtract 7 from 83, and what is left afterwards? You can subtract it as many times as you want, and it leaves 76 every time. ~Author Unknown In ancient times, they had no statistics, so they had to fall back on lies. ~Stephen B. Leacock
Goals Performance Engineering and Capacity Management • Goals of Performance Engineering Monitor/Manage/Predict System Performance Reflect and Understand Customer Experience Foundation of evidence-based Capacity Management • Goals of Capacity Management Assure Computing Supply is available to Meet Business Demand Determine Best use of existing resources (optimization)
Probability, Probity and Authority • Before the seventeenth century, legal evidence in Europe was considered of greater weight if a person testifying had “probity”. “Empirical evidence” was barely a concept. Probity was a measure of authority, so evidence came from authority. A noble person had probity. Yet today, probability is the very measure of the weight of empirical evidence in science, arrived at from inductive or statistical inference. • The term 'probable' (Latin probabilis) meant approvable, and was applied in that sense, to opinion and to action. A probable action or opinion was one such as sensible people would undertake or hold, in the circumstances. • Even so, the jury of executive opinion, in the business-government Enterprise, is most often swayed by the consensus of expert opinion, usually at considerable cost.
Probability and Statistics are not the same - They are related, but circuitously related: • Probability can be viewed either as the long-run frequency of occurrence or as a measure of the plausibility of an event given incomplete knowledge - but not both. • Statistics are functions of the observations (data) that often have useful and even surprising properties. • So we see the relationship(s) between probability and statistics: • From the observations we compute statistics that we use to estimate population parameters, which index the probability density, from which we can compute the probability of a future observation from that density. • In general, probability asks what is likely to happen and statistics describes what has already happened (and forms the basis for what is likely) • In statistics, you don’t know how a process works but are able to observe the outcomes; in probability you already know how a process works but want to know how to predict what will happen. The combination is the foundation of statistical inference.
Descriptive Statistics are used to describe the basic features of the data gathered from an experimental study in various ways. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. • Two objectives for formulating a summary statistic: • To choose a statistic that shows how different units seem similar. Statistical textbooks call one solution to this objective, a measure of central tendency. • To choose another statistic that shows how they differ. This kind of statistic is often called a measure of statistical variability.
“Central Tendency” Central – middle value, center Tendency – Expected value, most frequent, representative Arithmetic Mean The arithmetic mean is the most common measure of central tendency. It is simply the sum of the numbers divided by the number of numbers. The symbol M is used for the mean of a population. The symbol M is used for the mean of a sample. The formula for m is shown below: where ΣX is the sum of all the numbers in the numbers in the sample and N is the number of numbers in the sample. As an example, the mean of the numbers 1+2+3+6+8= =4 regardless of whether the numbers constitute the entire population or just a sample from the population.
Other, less common measures of central tendency: • Median is the middle value – the point where half the values lie on each side of the number, i.e. half are larger and half are smaller. The ‘middle’ of the distribution of values. The number separating the higher half of a sample, a population, or a probability distribution, from the lower half. If you divide a distribution into 4ths (quartiles), then the median is the 2nd quartile. • Useful in performance management in the presence of outliers where we are more concerned about frequency of occurrence relative to a ‘central’ value than a theoretical ‘average’ that many not even occur in the data. For example, response time. • Percentiles group data by putting equal numbers of data into each group. The nth percentile is the point below which n% of the data are found. • Useful in performance as it provides a very good view of the user’s experience. • Useful in capacity planning for ‘sizing’ a system based on accommodation of its historical high points. For example, the 90th percentile of CPU busy.
When to use the arithmetic mean: • When your data contains no outliers (extreme values that are not typical or normative). • When the variability is low between values, for example in utilization metrics.. when the variability is less than 20%. • What can you do about outliers (dirty data)? • Eliminate them (i.e. they are few and unlikely to reoccur). • Use a weighted mean that discounts the outliers. The weighted mean is similar to an arithmetic mean (the most common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others. • Use the Geometric Mean which has remarkable insensitivity to outliers.
The Dirty Data Experiment with the Weighted Mean =(1/19)-(1/19)*0.2 =(1/19)+((1/19)*0.2)/18 A convex combination is a linear combination of points (which can be vectors, scalars, etc.) where all coefficients are non-negative and sum up to 1.
“There are liars, outliers, and out-and out liars.” • What are ‘outliers’? • Extreme values not typical of the group • “Rare events” that do not fit within the range of other data values. • Non-normative data, anomalous, exceptional, etc. • How are they detected? • Visually using statistical graphics • Statistical Filtering • Interquartile fencing – less than lower quartile; greater than upper quartile • More advanced methods: Grubbs’ Test, etc There is no such thing as a simple test!
The Geometric Mean • Instead of adding the set of numbers and then dividing the sum by the count of numbers in the set, n, the numbers are multiplied and then the nth root of the resulting product is taken. • For instance, the geometric mean of two numbers, say 2 and 8, is just the square root (i.e., the second root) of their product, 16, which is 4. As another example, the geometric mean of 1, ½, and ¼ is the cube root (i.e., the third root) of their product (0.125), which is ½. In SQL-eese: SELECT EXP(AVG(LN(Response_Time))) as GEOMEAN FROM
The ‘geometry’ part of the Geometric Mean: Consider a ‘line’ where the beginning is at point ‘A’ and the end is at point ‘B’, where is the ‘middle’ (point ‘B’)? C A B?
Measures of variability • Variance – the amount of ‘spread’ in the data around the mean. • Standard Deviation – square root of the variance In a normal distribution approx 2/3 of the data are within one standard deviation of the mean on either side In performance large response time Std Devns are usually bad; you want it to be low and repeatable. Wide variations upset people more than long, but consistent times.
The Geometric Standard Deviation • The antilog of the standard deviation of the natural log transformed values of x or In SQL-eese: SELECT EXP(STDDEV(LN(Response_Time))) as GEOSTDEV FROM the_data WHERE Response_Time>0
Correlation and Regression • Correlation – How things vary together (or not); the strength and direction of a linear relationship between two random variables or the departure of two variables from independence. • There are several…Pearson, being the most common in performance analysis (but mis-named) • Probably the most misused statistical tool. • Obtained by dividing the covariance of two variables by the product of their standard deviations.
Linear Regression and it’s cousins (non-linear, multi-, and logistic, etc.) are all methods for fitting curves or lines to data in a statistically optimal manner. “The best way of drawing a line since the invention of the straight edge” – Pat Artis. • Often used by managers to observe ‘trends’ and predict the future (or explain the past). Often misused for the same purpose. • In statistics, linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line. The results are subject to statistical analysis.
Linear regression in Excel: Using Graphical techniques
Examples of Capacity/Performance Reporting in use now Traditional time series line charts…
Advanced Statistical Graphics 3-D Performance Surface Multi-temporal density plot Expected high/low/actual
Application Response Time Modeling System Unresponsive APPLICATION RESPONSE TIME Small Changes, Large Impact Large Changes, Small Impact l INCREASING APPLICATION WORKLOAD
How does Modeling differ from Trending in prediction? Application Modeling vs. Linear Regression via Trending Date predicted Via Trending Date predicted Via Modeling Application Response Time SLA Threshold System Load Measurement Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Application Workload
Response to Capacity/Performance Crisis: • I.System/Application tuning, re-engineering, and optimization : • Benefit: Considerable merit is obtained sometimes in the hundreds of percent improvements. Achieved via system administrative action (usually parametric changes for the OS) and by algorithmic and parametric re-specification (for the application). No capital expense. Efficient use of resources. • Detriments: The effects may not be enduring for dynamic systems as version/release changes and application functionality changes can, and do, degrade performance tuning effects quickly. Often system reinitiatlization (reboot, IPL) is required and creates an availability/service delivery issue. Application re-engineering for performance may be, and often is, cost prohibitive and/or unsupported by executive management. • 2. Capacity Increase via upgrade/replacement or technology refresh: • Benefit: Reduces risk of unsupported/unrecoverable infrastructure conditions. The effect is usually long term. Accommodates increased application functionality for business utility. • Detriments: Capital expense may be incurred. Inefficiencies remain. Risk management to avoid undersizing or oversizing requires expensive predictive modeling tools. Predictive analytics requires advanced skills in tech staffing. Risks associated with new technologies which may increase complexity (e.g. virtualization). Costs may be unsupported by executive management.
Modeling? Why? Reactive Problem Solving vs Modeling • damage grows rapidly with time; • the longer the error goes undiscovered, the more useless and damaging work based on the error will be done; • when the error is discovered, it and all the associated damage has to be removed; • the system will then need therapy to recover • the death rate increases dramatically with late discovery • alternatively, the survival rate increases dramatically with early discovery "Crude measures of the right things are better than precise measures of the wrong things." - from Jim Clemmer's article, "Strategic Measurements Guide Change and Improvement"
Predictive Analytics: Benefits • Predictive analytics provide a practical way to detect problems and allow early correction as well as avoid resource saturation conditions. • Simulation provides a practical way to detect such problems and allow early correction. Avoiding the use of simulation substantially increases the risk of failure. • Analytical modeling provides fast and accurate answers based on existing performance data. It allows for a variety of what-if scenarios to be easily crafted to determine the best course of action when systems are experiencing change. • Statistcal Forecasting and Analysis provides descriptive and predictive aspects of IT performance data topology thru the use of measures of central tendency, variability, correlation, linear regression, and statistical pattern recognition.
SAP-specific Capacity Planning Methodology for CCE • We want to acquire capacity to provide required service levels for sustained busy periods. Typical examples: • Month end closing • Busy daily window (e.g., 09:00 to 11:00) • Mondays • Complete batch window on time to deliver operational reports or schedule deliveries/shipment/print picking papers/etc • The best approach is to choose the percentile you want to satisfy • The 90th percentile of hourly mips across the month is reflective of busy daily periods • Likewise the 95th percentile reflects the sustained busy where there is a pronounced financial systems month end closing effect • In legacy OLTP we often see peak to average ratio’s between 1.5:1 and 2:1 based on the definition of peak (e.g, 90th vs 95th) • This really is a view of sustained busy • No one can afford to buy for absolute peaks (99th or 100th percentile)
Capacity Planning for the Newly Virtual Three Essential Elements • measurement to ascertain critical data like IT resource availability, utilization and usage patterns • second-level analysis to focus on the long-term needs of the enterprise rather than the immediate concern to bump up resources • business realignment to ensure that IT is keeping pace with business needs, not the other way around
Capacity Planning for the Newly Virtual • Over half (54%) of the virtual-server adopters have experienced a net growth in capacity, while only 7% reported a net decrease (ESG Research) • Focus on understanding our “virtualization” factors • Effect of non-concurrent peaks of multiple workloads • Follow the sun in a global operation • Better understanding of these effects can be gained by looking at the 90th/95th percentiles • Landscape dimensions: • a workload level, • a platform (processor complex) level, • a Sysplex / Cluster level • Server/Lpar level, etc. • The ‘virtualization’ analysis will tell us how much we can over-commit resources • The 95th percentile of the sums vs the sum of the 95th percentiles • It is often the case that we have the ability to load to 115% with the sum of the 95th percentiles
Organizational Support Institutionalize the process • The resource reporting and modeling is actually the easy part of this • The more difficult and more important part of institutionalizing the process is connecting the application blueprinting/design process to the capacity planning process: • This creates the understanding of the business drivers which is key to scaling factors and calibration • This is also a potential trigger for alerting the organization to the need for a risk mitigation plan. For example, step function workload increases with new workloads which should lead to a performance testing activity
Organizational Support for Capacity Planning Market the lesser-known benefits of capacity planning • Strengthened relationships with developers and end users. Communication, negotiation, and a sense of joint ownership can all combine to nurture a healthy, professional relationship between IT and its customers • Improved communications with suppliers. Involving key suppliers and support staffs with your capacity plans can promote effective communications among these groups • Increased collaboration with other infrastructure groups. Network services, technical support, database administration, operations, desktop support, and even facilities may all play a role in capacity planning. In order for the plan to be thorough and effective, all these various groups must support and collaborate with each other. • Promotion of a culture of strategic planning as opposed to tactical firefighting. One of the most significant benefits of developing an overall and ongoing capacity-planning program is the institutionalizing of a strategic-planning culture