290 likes | 446 Views
Introduction to Defect Prediction. Cmpe 589 Spring 2008. Problem 1. How to tell if the project is on schedule and within budget? Earned-value charts. Problem 2. How hard will it be for another organization to maintain this software? McCabe Complexity. Problem 3.
E N D
Introduction to Defect Prediction Cmpe 589 Spring 2008
Problem 1 • How to tell if the project is on schedule and within budget? • Earned-value charts.
Problem 2 • How hard will it be for another organization to maintain this software? • McCabe Complexity
Problem 3 • How to tell when the subsystems are ready to be integrated • Defect Density Metrics.
Problem Definition • Software development lifecycle: • Requirements • Design • Development • Test (Takes ~50% of overall time) • Detect and correct defects before delivering software. • Test strategies: • Expert judgment • Manual code reviews • Oracles/ Predictors as secondary tools
Defect Prediction • 2-Class Classification Problem. • Non-defective • If error = 0 • Defective • If error > 0 • 2 things needed: • Raw data: Source code • Software Metrics -> Static Code Attributes
c > 0 c Static Code Attributes • void main() • { • //This is a sample code • //Declare variables • int a, b, c; • // Initialize variables • a=2; • b=5; • //Find the sum and display c if greater than zero • c=sum(a,b); • if c < 0 • printf(“%d\n”, a); • return; • } • int sum(int a, int b) • { • // Returns the sum of two numbers • return a+b; • } LOC: Line of Code LOCC: Line of commented Code V: Number of unique operands&operators CC: Cyclometric Complexity
Defect Prediction • Machine Learning based models. • Defect density estimation • Regression models: error pronness • First classification then regression • Defect prediction between versions • Defect prediction for embedded systems
Constructing Predictors • Baseline: Naive Bayes. • Why?: Best reported results so far (Menzies et al., 2007) • Remove assumptions and construct different models. • Independent Attributes ->Multivariate dist. • Attributes of equal importance
Weighted Naive Bayes Naive Bayes Weighted Naive Bayes
Performance Measures Accuracy: (A+D)/(A+B+C+D) Pd (Hit Rate): D / (B+D) Pf (False Alarm Rate): C / (A+C)
Benefiting from defect data in practice • Within Company vs Cross Company Data • Investigated in cost estimation literature • No studies in defect prediction! • No conclusions in cost estimation… • Straight forward interpretation of results in defect prediction. • Possible reason: well defined features.
How much data do we need? • Consider: • Dataset size:1000 • Defect rate: 8% • Training instances: %90 • 1000*8%*90%=72 defective instances • (1000-72) non-defective instances
Intelligent data sampling • With random sampling of 100 instances we can learn as well as thousands. • Can we increase the performance with wiser sampling strategies? • Which data? • Practical aspects: Industrial case study.
WC vs CC Data? • When to use WC or CC? • How much data do we need to construct a model? ICSOFT’07
Module Structure vs Defect Rate • Fan-in, fan-out • Page Rank Algorithm • Call graph information on the code • “small is beautiful”