Extreme Metrics Analysis for Fun and Profit

Extreme Metrics Analysis for Fun and Profit Paul Below

Agenda • Statistical Thinking • Metrics Use: Reporting and Analysis • Measuring Process Improvement • Surveys and Sampling • Organizational Measures

Agenda Statistical Thinking “Experiments should be reproducible. They should all fail in the same way.”

Statistical Thinking • You already use it, at home and at work • We generalize in everyday thinking • Often, our generalizations or predictions are wrong

Uses for Statistics • Summarize our experiences so others can understand • Use information to make predictions or estimates • Goal is to do this more precisely than we would in everyday conversation

Listen for Questions • We are not used to using numbers in professional lives • “What does this mean?” • “What should we do with this?” • We need to take advantage of our past experience

Statistical Thinking is more important than methods or technology Analysis is iterative, not one shot Data Induction I I D D Deduction Model Learning (Modification of Shewhart/Deming cycle by George Box, 2000 Deming lecture, Statistics for Discovery)

Agenda Metrics Use: Reporting and Analysis "It ain't so much the things we don't know that get us in trouble. It's the things we know that ain't so." Artemus Ward, 19th Century American Humorist

Purpose of Metrics • The purpose of metrics is to take action. All types of analysis and reporting have the same high-level goal: to provide information to people who will act upon that information and thereby benefit. • Metrics offer a means to describe an activity in a quantitative form that would allow a knowledgeable person to make rational decisions. However, • Good statistical inference on bad data is no help. • Bad statistical analysis, even on the right variable, is still bad statistics.

Therefore… • Metrics use requires implemented processes for: • metrics collection, • reporting requirements determination, • metrics analysis, and • metrics reporting.

Types of Metrics Use “You go to your tailor for a suit of clothes and the first thing that he does is make some measurements; you go to your physician because you are ill and the first thing he does is make some measurements. The objects of making measurements in these two cases are different. They typify the two general objects of making measurements. They are: (a) To obtain quantitative information (b) To obtain a causal explanation of observed phenomena.” Walter Shewhart

The Four Types of Analysis • Ad hoc: Answer specific questions, usually in a short time frame. Example: Sales support • Reporting: Generate predefined output (graphs, tables) and publish or disseminate to defined audience, either on demand or on regular schedule. • Analysis: Use statistics and statistical thinking to investigate questions and reach conclusions. The questions are usually analytical (e.g., “Why?” or “How many will there be?”) in nature. • Data Mining: Data mining starts with data definition and cleansing, followed by automated knowledge extraction from historical data. Finally, analysis and expert review of the results is required.

Body of Knowledge (suggestions) • Reporting • Database query languages, distributed databases, query tools, graphical techniques, OLAP, Six Sigma Green Belt (or Black Belt), Goal-Question-Metric • Analysis • Statistics and statistical thinking, graphical techniques, database query languages, Six Sigma black belt, CSQE, CSQA • Data Mining • Data mining, OLAP, data warehousing, statistics

Analysis Decision Tree Type of Question? Enumerative Analytical Factors Analyzed: Few Many One Time? Yes No Ad hoc Reporting Analysis Data Mining and Analysis

Extreme Programming

Extreme Analysis • Short deadlines, small releases • Overall high level purposes defined up front, prior to analysis start • Specific questions prioritized prior to analysis start • Iterative approach with frequent stakeholder reviews to obtain interim feedback and new direction • Peer synergy, metrics analysts work in pairs. • Advanced query and analysis tools, saved work can be reused in future engagements • Data warehousing techniques, combining data from multiple sources where possible • Data cleansing done prior to analysis start (as much as possible) • Collective ownership of the results

Extreme Analysis Tips Produce clean graphs and tables displaying important information. These can be used by various people for multiple purposes. Explanations should be clear, organization should make it easy to find information of interest. However, It takes too long to analyze everything -- we cannot expect to produce interpretations for every graph we produce. And even when we do, the results are superficial because we don't have time to dig into everything. "Special analysis", where we focus in on one topic at a time, and study it in depth, is a good idea. Both because we can complete it in a reasonable time, and also because the result should be something of use to the audience. Therefore, ongoing feedback from the audience is crucial to obtaining useful results

Agenda Measuring Process Improvement “Is there any way that the data can show improvement when things aren’t improving?” -- Robert Grady

Measuring Process Improvement • Analysis can determine if a perceived difference could be attributed to random variation • Inferential techniques are commonly used in other fields, we have used them in software engineering for years • This is an overview, not a training class

Expand our Set of Techniques Metrics are used for: • Benchmarking • Process improvement • Prediction and trend analysis • Business decisions • …all of which require confidence analysis!

Is This a Meaningful Difference? 2.0 1.5 1.0 0.5 0 1 2 3 CMM Maturity Level Relative Performance

Pressure to Product Results • Why doesn’t the data show improvement? • “Take another sample!” • Good inference on bad data is no help “If you torture the data long enough, it will confess.” -- Ronald Coase

Types of Studies Anecdote  Case Study  Quasi-experimental  Experiment • Anecdote: “I heard it worked once”, cargo cult mentality • Case Study: some internal validity • Quasi-Experiment: can demonstrate external validity • Experiment: can be repeated, need to be carefully designed and controlled

Attributes of Experiments Subject  Treatment  Reaction • Random Assignment • Blocked and Unblocked • Single Factor and Multi Factor • Census or Sample • Double Blind • When you really have to prove causation (can be expensive)

Limitations of Retrospective Studies • No pretest, we use previous data from similar past projects • No random assignment possible • No control group • Cannot custom design metrics (have to use what you have)

Quasi-Experimental Designs • There are many variations • Common theme is to increase internal validity through reasonable comparisons between groups • Useful when formal experiment is not possible • Can address some limitations of retrospective studies

Causation in Absence of Experiment • Strength and consistency of the association • Temporal relationship • Non-spuriousness • Theoretical adequacy

What Should We Look For? Are the Conclusions Warranted? • Some information to accompany claims: • measure of variation • sample size • confidence intervals • data collection methods used • sources • analysis methods

Decision Without Analysis • Conclusions may be wrong or misleading • Observed effects tend to be unexplainable • Statistics allows us to make honest, verifiable conclusions from data

Types of Confidence Analysis

Two Techniques We Use Frequently • Inference for difference between two means • Works for quantitative variables • Compute confidence interval for the difference between the means • Inference for two-way tables • Works for categorical variables • Compare actual and expected counts

Quantitative Variables Comparison of means of quartiles 2 and 4 yields p value of 88.2%, not a significant difference at 95% level)

Categorical Variables P value is approximately 50%

Categorical Variables P value is greater than 99.9%

Expressing the Results “in English” • “We are 95% certain that the difference in average productivity for these two project types is between 11 and 21 FP/PM.” • “Some project types have a greater likelihood of cancellation than other types, we would be unlikely to see these results by chance.”

What if... • Current data is insufficient • Experiment can not be done • Direct observation or 100% collection cannot be done • or, lower level information is needed?

Agenda Surveys and Samples In a scientific survey every person in the population has some known positive probability of being selected.

What is a Survey? • A way to gather information about a population from a sample of that population • Varying purposes • Different ways: • telephone • mail • internet • in person

What is a Sample? • Representative fraction of the population • Random selection • Can reliably project to the larger population

What is a Margin of Error? • An estimate from a survey is unlikely to exactly equal to quantity of interest • Sampling error means results differ from a target population due to “luck of the draw” • Margin of error depends on sample size and sample design

What Makes a Sample Unrepresentative? • Subjective or arbitrary selection • Respondents are volunteers • Questionable intent

How Large Should the Sample Be? • What do you want to learn? • How reliable must the result be? • Size of population is not important • 1500 people is reliable enough for entire U.S. • How large CAN it be?

“Dewey Defeats Truman” • Prominent example of a poorly conceived survey • 1948 pre-election poll • Main flaw: non-representative sample • 2000 election: methods not modified to new situation

Is Flawed Sample the Only Type of Problem That Happens? • Non-response • Measurement difficulties • Design problems, leading questions • Analysis problems

Some Remedies • Stratify sample • Adjust for incomplete coverage • Maximize response rate • Test questions for • clarity • objectivity • Train interviewers

Agenda Organizational Measures “Whether measurement is intended to motivate or to provide information, or both, turns out to be very important.” -- Robert Austin

Dysfunctional Measures • Disconnect between measure and goal • Can one get worse while the other gets better? • Is one measure used for two incompatible goals? • The two general types of measurement are...

Measurement in Organizations • Motivational Measurements • intended to affect the people being measured, to provoke greater expenditure of effort in pursuit of org’s goals • Informational Measurements • logistical, status, or research information, provide insight to provide short term management and long term improvement

Informational Measurements • Process Refinement Measurements • reveals detailed structure of processes • Coordination Measurements • logistical purpose

Mixed Measurements The desire to be viewed favorably provides an incentive for people being measured to tailor, supplement, repackage, or censor information that flows upward. • “Dashboard” concept is incomplete • We have Gremlins

Extreme Metrics Analysis for Fun and Profit