400 likes | 735 Views
Guidelines for Conducting and Evaluating Empirical Studies. Barbara Kitchenham Keele University. Guidelines for empirical studies. Agenda. Introduction Guideline topics Experimental context Study Design Data Collection Analysis Result presentation Interpretation Conclusions.
E N D
Guidelines for Conducting and Evaluating Empirical Studies Barbara Kitchenham Keele University Guidelines for empirical studies
Agenda • Introduction • Guideline topics • Experimental context • Study Design • Data Collection • Analysis • Result presentation • Interpretation • Conclusions Guidelines for empirical studies
Introduction • Empirical research in applied disciplines is weak • Medical research • Yancey reviewed A.J. Surgery and found • “methodologic errors so as to render invalid the conclusions of the authors” • Welch & Gaber reviewed A.J. Obstetrics and found • Half the papers could not be assessed because of poor reporting • Third of the paper used statistics inappropriately Guidelines for empirical studies
Empirical Software Engineering • In my experience (as a reviewer and researcher) • Just as bad (if not worse) than medicine • David Hoaglin, former vice president of the American Statistical Society • Reviewed 8 papers from IEEE Trans SE • Poor study design • Inappropriate use of statistical techniques • Conclusions that don’t follow from results Guidelines for empirical studies
Possible Solution • Medical researchers have produced statistical guidelines for researchers • CONSORT • Adopted by 70 journals • International guidelines on statistical principles for clinical trials ICH E9 • Guidelines might help SE research • Improve empirical studies • Assist reviewers • Enable meta-analysis Guidelines for empirical studies
Interest group • Dr Shari Lawrence Pfleeger • Assistant editor of IEEE Trans-SE • Dr Lesley Pickard • Statistician & SE Researcher • Prof Peter Jones • Professor of Statistics • Jarret Rosenberg • Statistician working for SUN • Dr David Hoaglin Guidelines for empirical studies
Context Guidelines • CG1: Be sure to specify as much of the software engineering context as possible • CG2: State the hypothesis so that the study’s objectives are clear • CG3: Discuss the theory from which the hypothesis is derived Guidelines for empirical studies
Empirical context • Two types of context • “Extraneous” factors that affect software development • Product & Process variety • Corporate culture • Client expectations • Staff Motivation • Theoretical background to empirical study • Both have implications for empirical research Guidelines for empirical studies
Product & process diversity • Immense variety in products, methods, procedures, culture, • How can we tell whether results obtained in one environment will apply in another? • Need richer descriptions of context than is normally provided • “Company X, a large multinational telecommunications company,..” Guidelines for empirical studies
Maintenance Ontology • Aimed to identify concepts that affect empirical maintenance studies • 12 “concepts” including • product • procedure • human resources • service level agreement • 23 concept properties that might affect results including • product size, maturity, quality, age • user type and population size Guidelines for empirical studies
Theoretical context • Scientific method • Observe behaviour • Develop a theory • Test the theory • Study hypothesis should reflect derive from the theory • Early empirical studies didn’t state hypothesis at all Guidelines for empirical studies
Shallow hypotheses • Most researchers now state their hypotheses but they are usually not derived from a theory • Collect complexity metrics and fault counts and do a correlation analysis • Null hypothesis “There is no correlation between complexity and number of faults” • Hypothesis has no explanatory power Guidelines for empirical studies
Deep Hypothesis • Vinter, Loomes & Kornbrot • Interested in validity of claims made about formal methods • Investigated cognitive psychology research concerned with logic errors • Studied logic error made by people with formal methods background • Null hypothesis that the error would be the same as those observed for naïve subjects Guidelines for empirical studies
Design Guidelines - 1 • DG1: Identify the population from which subject/objects are drawn • DG2: Define the process by which subject/objects were selected • DG3: Define the process by which subjects/objects are assigned to treatments Guidelines for empirical studies
Design Guidelines - 2 • DG4: Perform a pre-experiment or pre-calculation to identify the required sample size and experimental power • DG5: Justify the choice of outcome measures in terms of their relevance to the objectives of the empirical study • DG6: Restrict yourself to simple experiments or, at least, to designs that are fully defined in the literature Guidelines for empirical studies
Design Guidelines - 3 • DG7: Explain how you handle possible subject bias • DG8: Avoid evaluating your own inventions, or make explicit any vested interests Guidelines for empirical studies
Subject Selection • Sample (subject) selection • How was the sample obtained? • Was there any bias? • All responses to a Web posting • Self-selecting samples are not random • From what population was it derived? • Statistical results can only apply to the defined population • 2nd year undergraduates are only representative of 2nd year undergraduates Guidelines for empirical studies
Randomisation Pitfalls • Was the randomisation process suitable? • Assign students in one class to one treatment, students in another class to another • The experimental “unit” is class not student • 0 degrees of freedom to test the treatment Guidelines for empirical studies
Sample size • Was the sample size appropriate? • Should perform a pre-study • Assess the likely size of effect • Identify appropriate sample size for full experiment • Identify the power of the experiment Guidelines for empirical studies
Importance of Power • Need to ensure that power is acceptable • power>0.9 • depends on • specific alternative hypothesis • sample size • Non-parametric tests less powerful than parametric tests • If parametric assumptions are true
Surrogate measures • Are the outcome measures appropriate? • Use of surrogate measures can be misleading • Defect counts for quality • Pre-release defect rates instead of post-release rates Guidelines for empirical studies
Complex Designs • Experimental design determines the appropriate analysis • Some designs are too complex to analyse • Software tasks are affected by subject experience & capability • Cross-over designs use the different treatments on the same subject • Allow for experience but add design complexity • Order of treatment may affect subject experience • So order needs to be randomised • Adds to complexity of the cross-over experiment Guidelines for empirical studies
Experimenter Influence • Cannot perform blind or double-blind experiments in SE • A subject must by definition know what treatment he/she is assigned to • May be affected by expectation • How do we stop experimenter influence? • No way of addressing this problem • Beware vested interests Guidelines for empirical studies
Data Collection Guidelines • DC1: For surveys, • specify the response rate • discuss the representativeness of the responses • discuss the impact of non-response • DC2: Define all software metrics fully Guidelines for empirical studies
Data Collection Guidelines -2 • DC3: For experiments record data on subjects who drop-out from experiments • DC4: For experiments, record data about other performance measures that you do not want to be adversely affected by the treatment even if they are not the main focus of the study Guidelines for empirical studies
Metrics Definitions • Software metrics are not well-defined • Need to specify entity, attribute and counting rules • Counting rules • Define the where, when, & how of collecting measures • Defect counts are not comparable unless you know • When in the development process counting started Guidelines for empirical studies
Extra data • Avoid sub-optimization • Many software engineering factors are related • If we test a hypothesis about productivity, we should consider the impact on quality Guidelines for empirical studies
Analysis Guidelines • AG1: Analyze data in accordance with the experimental design • AG2: Justify any non-standard analysis • AG3: Identify whether your statistics are inferential or descriptive Guidelines for empirical studies
Analysis Guidelines - 2 • AG4: Adjust significance levels if performing many significance tests on the same dataset • AG5: If possible, use blind analysis • AG6: Perform sensitivity analyses Guidelines for empirical studies
Multiple tests • Researchers often perform many tests on the same data set • With many tests some “significant” results will occur by chance • 20 tests at p=0.05, 1 spurious significant result almost certain • Need to adjust significance levels when performing multiple tests • Bonferroni adjustment • 10 tests, for 0.05 overall, 0.005 for individual tests Guidelines for empirical studies
Analysis Pitfalls • Experimenters often fish for results • Consider various different subsets of the data until you get the result they want • This problem is reduced by blind analysis • Analyst doesn’t know which treatment is which • Software datasets often have outliers • Sensitivity analysis ensures results are not due to outliers Guidelines for empirical studies
Presentation Guidelines • PG1: Describe or reference all statistical procedures used • PG2: Present analyses that are relevant to the hypothesis • PG3: Present quantitative results as well as significance levels. Quantitative results should show the magnitude of effects and confidence limits Guidelines for empirical studies
Presentation Guidelines - 2 • PG4: Present raw data, otherwise confirm that it is available for confidential review by reviewers and independent auditors Guidelines for empirical studies
Raw data • It is important to allow readers to draw their own conclusions • Need access to raw data • Yancey “When science stops being public it stops being science” • Most software engineering results are “Company confidential” • Reviewers or auditors should be able to view the data Guidelines for empirical studies
Interpretation of Results • IG1: Define the population to which inferential statistics apply • IG2: Differentiate between statistical significance and practical importance • IG3: Define the type of study • IG4: Specify any limitations of the study • IG5: Ensure conclusions arise from the presented results Guidelines for empirical studies
Study types • There are differences between the type of inferences you can make from different types of study • Yancey: “Only truly randomized tightly controlled prospective studies provide an opportunity for cause and effect statements” • Regression and correlation studies can only lead to weak conclusions. Guidelines for empirical studies
Conclusions • Empirical software engineering needs to improve • Guidelines offer a means of propagating good practice • Need to be accepted by researchers • Need to be adopted by journal editors • These guidelines are a starting point • Need input from wider group Guidelines for empirical studies
References • B.A. Kitchenham, R.T. Hughes and S.G. Linkman. Modeling software measurement data. IEEE Trans on SE (In press). • B.A. Kitchenham, S.G. Linkman and D.T. Law. Critical review of quantitative assessment. Software Engineering Journal 9(2), 1994, pp43-53 • B.A. Kitchenham, G. Travassos, A. von Mayhhauser, F. Niessink, N.F. Schneidewind, J. Singer, S. Takada, R. Vehvilainen, and H. Yang. Towards an Ontology of Software Maintenance. JSM In press. • L.M. Pickard, B.A. Kitchenham and P. Jones. Combining empirical results in software engineering. Information and Software technology 40(14), 1998, pp811-821 • G.A. Milliken and D.A. Johnson Analysis of Messy Data Volums 1: Designed Experiments. Chapman & Hall, 1992, Chapters 5 & 32 Guidelines for empirical studies
References • W.F. Rossenberger. Dealing with multiplicities in pharmacoepidemiologic studies. Pharmacoepidemiologic and Drug Safety, 5, 1996, pp95-100 • R. Vinter, M. Loomes and D. Kornbrot. Applying software metrics to formal specifications: a cognitive approach. proceedings of 5th International Software Metrics Symposium. IEEE Computer Society Press, 1998, pp216-223. • G.E. Welch and S.G. Gabbe. Review of statistics usage in the American Journal of Obstetrics and Gynecology, 175 (5), 1996, pp1138-1141. • J.M. Yancey. Ten rules for reading clinical research reports. Guest editorial. American Journal of Orthodontics and Dentofacial Orthopedics, 109 (5), May 1996, pp558-564. Guidelines for empirical studies