200 likes | 317 Views
Nearest Neighbor Sampling for Better Defect Prediction. Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston, Texas, USA. The Problem: Why is there not more ML in Software Engineering?. Machine Learning. 7 to 16%. Algorithmic. Human-Based
E N D
Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston, Texas, USA
The Problem: Why is there not more ML in Software Engineering? Machine Learning 7 to 16% Algorithmic Human-Based 62 to 86% [Jørgensen 2004]
Key Idea More ML in SE through a more defined experimental process.
Agenda • A better defined process for better predicting (quality) • Experiments: Nearest Neighbor Sampling on PROMISE Defect data sets • Extending the approach • Discussion • Conclusions
A Better Defined Process • Emphasis of ML approaches • Emphasis on Measuring Success • PRED(X) • Accuracy • MARE • Prediction success depends upon the relationship between training and test data.
PROMISE Defect Data (from NASA) • 21 Inputs • Size (SLOC, Comments) • Complexity (McCabe Cyclomatic Comp.) • Vocabulary (Halstead Operators, Operands) • 1 Output: Number of Defects
Data Preprocessing Reduced to 2 classes
Training JM1 40% of Original Data 6904 with 0 Defects }22% 2007 with 1+ Defects Nice Test Nasty Test Experiment 1
Training Nice Test Experiment 1 Continued Remaining Vectors from Data set Remaining Vectors from Data set Nasty Test
Experiment 1 Continued • J48 and Naïve Bayes Classifiers from WEKA • 200 Trials (100 Nice Test Data + 100 Nasty Test Data) • CM1 • JM1 • KC1 • KC2 • PC1 20 Nice Trials + 20 Nasty Trials
Results: Average Confusion Matrix Average Nice Results Note the distribution: 0 Defects Average Nasty Results 1+ Defects
Assessing Experiment Difficulty Exp_Difficulty = 1 - Matches / Total_Test_Instances Match = Test vector’s nearest neighbor is from the same class instance in the training set. Hard experiment Experimental Difficulty = 1 Experimental Difficulty = 0 Easy experiment
Assessing Overall Data Difficulty Overall Data Difficulty = 1 - Matches / Total_Data_Instances Match = A data vector’s nearest neighbor is from the same class instance as another vector in the data set. Difficult Data Overall Data Difficulty = 1 Overall Data Difficulty = 0 Easy Data
Discussion: Anticipated Benefits • Method for characterizing difficulty of experiment • More realistic models • Easy to implement • Can be integrated into N-Way Cross Validation • Can apply to various types of SE data sets: • Defect Prediction • Effort Estimation • Can be extended beyond SE to other domains
Discussion: Potential Problems • More work needs to be done • Agreement on how to measure Experimental Difficulty • Extra overhead • Implicitly or Explicitly Data Staved Domain
Conclusions How to get more ML in SE? Assess experiments/data for their difficulty Benefits: • More credibility to the modeling process • More reliable predictors • More realistic models
Acknowledgements Thanks to the reviewers for their comments!
References 1) M. Jørgensen, A Review of Studies on Expert Estimation of Software Development Effort, Journal Systems and Software, Vol 70, Issues 1-2, 2004, Pp. 37-60.