From myths and fashions to evidence-based software engineering

From myths and fashions to evidence-based software engineering Magne Jørgensen

Most of the methods below have once been (some still are) fashionable ... • The Waterfall model, the sashimi model, agile development, rapid application development (RAD), unified process (UP), lean development, modified waterfall model, spiral model development, iterative and incremental development, evolutionary development (EVO), feature driven development (FDD), design to cost, 4 cycle of control (4CC) framework, design to tools, re-used based development, rapid prototyping, timebox development, joint application development (JAD), adaptive software development, dynamic systems development method (DSDM), extreme programming (XP), pragmatic programming, scrum, test driven development (TDD), model-driven development, agile unified process, behavior driven development, code and fix, design driven development, V-model-based development, solution delivery, cleanroom development, .....

The paper clip was invented by a Norwegian

Short men are more aggressive (The Napoleon complex)

Most (93%) of our communication is non-verbal

Therewere/is a softwarecrisis (page 13 of their 1994-report): “We then called and mailed a number of confidential surveys to a random sample of top IT executives, asking them to share failure stories.”

45% of features of “traditional projects” are never used(source: The Standish Group, XP 2002) No-one seems to know (and the Standish Group does not tell) anything about this study! Why do so many believe (and use) this non-interpretable, non-validated claim? They benefit from it (agile community) + confirmation bias (we all know at least one instance that fit the claim)

14% Waterfall and 42% of Agile projects are successful(source: The Standish Group, The Chaos Manifesto 2012) Successful = “On cost, on schedule and with specified functionality” Can you spot a serious error of this comparison?

The number one in the stink parade …

The ease of creating myths: Are risk-willing or risk-averse developers better? Group A: Group B: Initially Average 3.3 Initially Average 5.4 Debriefing Average 2: 3.5 Debriefing Average 2: 5.0 2 weeks later Average 3: 3.5 2 weeks later Average 3: 4.9 Study design: Research evidence + Self-generated argument. Question: Based on your experience, do you think that risk-willing programmers are better than risk-averse programmers?1 (totally agree) – 5 (No difference) - 10 (totally disagree) Neutral group: Average 5.0

“I see it when I believe it” vs “I believe it when I see it” • 26 experienced software managers • Different preferences on contract types: Fixed price or per hour • Clients tended to prefer fixed price, while providers were more in favor of per hour • Presentation of a data set of 16 projects with information about contract type and project outcome (client benefits and cost-efficiency of the development work) • Results: Chi-square of independence gives p=0.01

Bias among researchers … Effect size = MMRE_analogy – MMRE_regression Regression-based models better Analogy-based models better

Development of own analogy-basedmodel (vested interests) Effect size = MMRE_analogy – MMRE_regression Regression-based models better Analogy-based models better

How many results are incorrect? The effect of low power, researcher bias and publication bias

Correct test results: (150 + 475)/1000 = 62.5% 500 true relationships Proportion exp. stat. sign results: (150+25)/1000 = 17.5% Statistical power is 30% -> 150 True positive (green) 1000 statisticaltests Significance level is 5% -> 25 False positive (red) Correct positive tests: 150/(150+25) = 85.7% (prob. of null hyp. being true when p<0.05 is 14.4%, not 5%) 500 false relationships

We observe about 50% p<0.05 in published SE experiments • We should expect 17.5% • Maximum 30%, if we only test true relationships • Researcher and publication bias

effect of adding 20% researcher bias and 30% publication bias

Removes 78 negative tests (30% publication bias) 42% positive tests Researcher bias is 20% -> 70 more true positive tests (blue) Correct test results: 61% (just above half of the tests) Statistical power is 30% -> 150 true positive (green) 1000 statistical tests Significance level is 5% -> 25 false positive (red) Correct positive tests: 65% One third of the reported positive tests are incorrect! Researcher bias is 20% -> 95 more false positive tests (blue) Removes 114 negative tests (30% publication bias)

Low proportion of correct results!We need to improve statistical research practices in Software engineering!In particular, we need to increase statistical power (increased sample size)

Have you heard about the assumption of Fixed variables?

IIlustration: Salary discrimination? • Assume an IT-company which: • Has 100 different tasks they want to complete and for each task hire one male and one female (200 workers) • The “base salary” of a task varies (randomly) from 50.000 to 60.000 USD and is the same for the male and the female employees. • The actual salary is the “base salary” added a random, gender independent, bonus. This is done through use of a “lucky wheel” with numbers (bonuses) between 0 and 10.000. • This should lead to (on average): Salary of female = Salary of male • Let’s du a regression analysis with: “Salary of female = a + b*Salary of male” • b<1 means that women are discriminated • The regression analysis gives b=0.56. Strong discrimination of women!? • Let’s repeat the analysis on the same data with the model: “Salary of male = a* + b**Salary of female” • The regression analysis gives b*=0.56. Strong discrimination of men????

Salary women Salary men Salary men Salary women

How would you interpret these data?(from a published study) CR duration = Actual duration (effort) to complete a change request Interpretation by the author of the paper: Larger tasks are more under-estimated.

What about these data? They are from the exact same data set! The only difference is in the use of the estimated instead of actual duration as the task size variable.

Economy of scale? Probably not ... (M. Jørgensen and B. Kitchenham. Interpretation problems related to the use of regression models to decide on economy of scale in software development, Journal of Systems and Software, 85(11):2494-2503, 2012.)

How to make software engineering more EVIdence-based?

Evidence-based software engineering (EBSE) The main steps of EBSE are as follows: • Convert a relevant problem or need for information into an answerable question. • Search the literature and practice-based experience for the best available evidence to answer the question. (+ create own local evidence, if needed) • Critically appraise the evidence for its validity, impact, and applicability. • Integrate the appraised evidence with practical experience and the client's values and circumstances to make decisions about practice. • Evaluate performance in comparison with previous performance and seek ways to improve it.

The software industry should learn to formulate questions meaningful for their context/challenge/problem The question “Is Agile better than Traditional methods?” is NOT answerable. • What is agile? • What is traditional? • What is better? • What is the context?

Learn to be more critical (myth busting) when claims are made • Find out what is meant by the claim. • Is it possible to falsify the claim? If not, what is the function of the claim? • Put yourself in a ”critical mode” • Raise the awareness of the tendency to accept claims, even without valid evidence, when you agree/it seems intuitively correct. • Reflect on what you would consider as valid evidence to support the claim. • Vested interests? • Do you agree because of the source? • Collect and evaluate evidence • Research-based, practice-base, and “own” evidence • Synthesize evidence and conclude (if possible)

Learn to question what statements and claims means

Claim Data Warrant Backing Qualifier Reservation Learn how to evaluate argumentation

Learn how to use google scholar (or similar sources of research-based evidence)

Learn how to collect and evaluate practice-based experience • Methods similar to evaluation of research-based evidence and claims • Be aware of “organizational over-learning”

Learn how to create local evidence • Experimentation is simpler than you think • Pilot studies • Trial-sourcing • Controlled experiments

Is it realistic to achieve an evidence-based software engineering profession? • Yes,but there are challenges. • Main challenges: • Not much research. • High number of different contexts • Much research has a low reliability, sometimes hard to identify • Opportunities: • More and better use of practice-based evidence • More experimenting in local contexts

Coffee dehydrates your body?

From myths and fashions to evidence-based software engineering

From myths and fashions to evidence-based software engineering

Presentation Transcript

MODEL BASED SOFTWARE ENGINEERING

Component-Based Software Engineering

Component Based Software Engineering

Myths about MOOCs and Software Engineering Education

Lessons from engineering: Can software benefit from product based evidence of reliability?

Component-Based Software Engineering

Evidence-Based Software Engineering

Evidence-Based Software Engineering: A Paradigm for the Future?

Component Based Software Engineering

Component-based software engineering

Component-based Software Engineering

Evidence Based Software Engineering

Software Engineering Evidence

Introduction to Component-Based Software Engineering

Component-Based Software Engineering

Component Based Software Engineering

Laboratory Diagnostics: from Eminence based to Evidence based

Component-Based Software Engineering

Component Based Software Engineering

Component Based Software Engineering

Model-based Software Engineering