360 likes | 388 Views
From myths and fashions to evidence-based software engineering. Magne Jørgensen. Most of the methods below have once been (some still are) fashionable.
E N D
From myths and fashions to evidence-based software engineering Magne Jørgensen
Most of the methods below have once been (some still are) fashionable ... • The Waterfall model, the sashimi model, agile development, rapid application development (RAD), unified process (UP), lean development, modified waterfall model, spiral model development, iterative and incremental development, evolutionary development (EVO), feature driven development (FDD), design to cost, 4 cycle of control (4CC) framework, design to tools, re-used based development, rapid prototyping, timebox development, joint application development (JAD), adaptive software development, dynamic systems development method (DSDM), extreme programming (XP), pragmatic programming, scrum, test driven development (TDD), model-driven development, agile unified process, behavior driven development, code and fix, design driven development, V-model-based development, solution delivery, cleanroom development, .....
Short men are more aggressive (The Napoleon complex)
Therewere/is a softwarecrisis (page 13 of their 1994-report): “We then called and mailed a number of confidential surveys to a random sample of top IT executives, asking them to share failure stories.”
45% of features of “traditional projects” are never used(source: The Standish Group, XP 2002) No-one seems to know (and the Standish Group does not tell) anything about this study! Why do so many believe (and use) this non-interpretable, non-validated claim? They benefit from it (agile community) + confirmation bias (we all know at least one instance that fit the claim)
14% Waterfall and 42% of Agile projects are successful(source: The Standish Group, The Chaos Manifesto 2012) Successful = “On cost, on schedule and with specified functionality” Can you spot a serious error of this comparison?
The ease of creating myths: Are risk-willing or risk-averse developers better? Group A: Group B: Initially Average 3.3 Initially Average 5.4 Debriefing Average 2: 3.5 Debriefing Average 2: 5.0 2 weeks later Average 3: 3.5 2 weeks later Average 3: 4.9 Study design: Research evidence + Self-generated argument. Question: Based on your experience, do you think that risk-willing programmers are better than risk-averse programmers?1 (totally agree) – 5 (No difference) - 10 (totally disagree) Neutral group: Average 5.0
“I see it when I believe it” vs “I believe it when I see it” • 26 experienced software managers • Different preferences on contract types: Fixed price or per hour • Clients tended to prefer fixed price, while providers were more in favor of per hour • Presentation of a data set of 16 projects with information about contract type and project outcome (client benefits and cost-efficiency of the development work) • Results: Chi-square of independence gives p=0.01
Bias among researchers … Effect size = MMRE_analogy – MMRE_regression Regression-based models better Analogy-based models better
Development of own analogy-basedmodel (vested interests) Effect size = MMRE_analogy – MMRE_regression Regression-based models better Analogy-based models better
How many results are incorrect? The effect of low power, researcher bias and publication bias
Correct test results: (150 + 475)/1000 = 62.5% 500 true relationships Proportion exp. stat. sign results: (150+25)/1000 = 17.5% Statistical power is 30% -> 150 True positive (green) 1000 statisticaltests Significance level is 5% -> 25 False positive (red) Correct positive tests: 150/(150+25) = 85.7% (prob. of null hyp. being true when p<0.05 is 14.4%, not 5%) 500 false relationships
We observe about 50% p<0.05 in published SE experiments • We should expect 17.5% • Maximum 30%, if we only test true relationships • Researcher and publication bias
effect of adding 20% researcher bias and 30% publication bias
Removes 78 negative tests (30% publication bias) 42% positive tests Researcher bias is 20% -> 70 more true positive tests (blue) Correct test results: 61% (just above half of the tests) Statistical power is 30% -> 150 true positive (green) 1000 statistical tests Significance level is 5% -> 25 false positive (red) Correct positive tests: 65% One third of the reported positive tests are incorrect! Researcher bias is 20% -> 95 more false positive tests (blue) Removes 114 negative tests (30% publication bias)
Low proportion of correct results!We need to improve statistical research practices in Software engineering!In particular, we need to increase statistical power (increased sample size)
Have you heard about the assumption of Fixed variables?
IIlustration: Salary discrimination? • Assume an IT-company which: • Has 100 different tasks they want to complete and for each task hire one male and one female (200 workers) • The “base salary” of a task varies (randomly) from 50.000 to 60.000 USD and is the same for the male and the female employees. • The actual salary is the “base salary” added a random, gender independent, bonus. This is done through use of a “lucky wheel” with numbers (bonuses) between 0 and 10.000. • This should lead to (on average): Salary of female = Salary of male • Let’s du a regression analysis with: “Salary of female = a + b*Salary of male” • b<1 means that women are discriminated • The regression analysis gives b=0.56. Strong discrimination of women!? • Let’s repeat the analysis on the same data with the model: “Salary of male = a* + b**Salary of female” • The regression analysis gives b*=0.56. Strong discrimination of men????
Salary women Salary men Salary men Salary women
How would you interpret these data?(from a published study) CR duration = Actual duration (effort) to complete a change request Interpretation by the author of the paper: Larger tasks are more under-estimated.
What about these data? They are from the exact same data set! The only difference is in the use of the estimated instead of actual duration as the task size variable.
Economy of scale? Probably not ... (M. Jørgensen and B. Kitchenham. Interpretation problems related to the use of regression models to decide on economy of scale in software development, Journal of Systems and Software, 85(11):2494-2503, 2012.)
Evidence-based software engineering (EBSE) The main steps of EBSE are as follows: • Convert a relevant problem or need for information into an answerable question. • Search the literature and practice-based experience for the best available evidence to answer the question. (+ create own local evidence, if needed) • Critically appraise the evidence for its validity, impact, and applicability. • Integrate the appraised evidence with practical experience and the client's values and circumstances to make decisions about practice. • Evaluate performance in comparison with previous performance and seek ways to improve it.
The software industry should learn to formulate questions meaningful for their context/challenge/problem The question “Is Agile better than Traditional methods?” is NOT answerable. • What is agile? • What is traditional? • What is better? • What is the context?
Learn to be more critical (myth busting) when claims are made • Find out what is meant by the claim. • Is it possible to falsify the claim? If not, what is the function of the claim? • Put yourself in a ”critical mode” • Raise the awareness of the tendency to accept claims, even without valid evidence, when you agree/it seems intuitively correct. • Reflect on what you would consider as valid evidence to support the claim. • Vested interests? • Do you agree because of the source? • Collect and evaluate evidence • Research-based, practice-base, and “own” evidence • Synthesize evidence and conclude (if possible)
Claim Data Warrant Backing Qualifier Reservation Learn how to evaluate argumentation
Learn how to use google scholar (or similar sources of research-based evidence)
Learn how to collect and evaluate practice-based experience • Methods similar to evaluation of research-based evidence and claims • Be aware of “organizational over-learning”
Learn how to create local evidence • Experimentation is simpler than you think • Pilot studies • Trial-sourcing • Controlled experiments
Is it realistic to achieve an evidence-based software engineering profession? • Yes,but there are challenges. • Main challenges: • Not much research. • High number of different contexts • Much research has a low reliability, sometimes hard to identify • Opportunities: • More and better use of practice-based evidence • More experimenting in local contexts