On the application of GP for software engineering predictive modeling: A systematic review

On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif Afzal, Richard Torkar Blekinge Institute of Technology, Karlskrona, Sweden. {waf,rto}@bth.se

Agenda • Research question • Symbolic regression • Prediction and estimation in sw engineering • GP for prediction and estimation in sw engineering • Application of GP for sw quality classification • Application of GP for sw cost/effort/size estimation • Application of GP for sw fault prediction and sw reliability growth modeling • Future work • Conclusions • Recommendations

Our research question • Is there evidence that: symbolic regression using GP is an effective method for: prediciton and estimation, in comparison with: regression, machine learning and other models (including expert opinion and different improvements over the standard GP algorithm)?

It is about symbolic regression! • Symbolic regression – One of the many application areas of GP • Finds a function, with the outputs having desired outcomes. • Makes no assumptions about: • Structure of the function • Data distribution • Relationship between independent and dependent variables • Helps in identifying the significant variables in subsequent modeling attempts

Prediction and estimation in sw engineering • Software quality • Software quality classification • Software fault prediction • Software reliability growth modeling • Software size • Software development cost/effort • Maintenance task effort • Software release timing

GP for prediction and estimation in sw engineering • 23 identified primary studies • Software quality classification (8) • Software cost/effort/size estimation (7) • Software fault prediction and software reliability growth modeling (8)

GP for prediction and estimation in sw engineering cntd…

Application of GP for sw quality classification (8 studies) • Variations of the dependent variable: • Fault proneness • Quality ranking of program modules (high risk to low risk) • Variations in sampling of training and testing sets: • Simple hold-out and 10-fold CV.

Application of GP for sw quality classification cntd… • Variations in fitness function • Single objective • Minimization of root mean square • Minimization of average cost of misclassification • Multi-objective • Minimization of average cost of misclassification + minimization of tree size • Maximization of the best percentage of the actual faults averaged over the percentiles level of interest + controlling the tree size. • Balancing the over sampling and under sampling in each class for a decision tree.

Application of GP for sw quality classification cntd… • Variations in comparison groups: • Neural networks • k-nearnest neighbour • Regression (linear, logistic) • Humans

Application of GP for sw quality classification cntd… • Results: • Majority of the studies (6 out of 8) reported results in favor of using GP for the classification task. • Limitations: • Increase the comparisons with a more representative set of techniques. • Increase the use of publically available data sets for easier replications.

Application of GP for sw quality classification cntd… • Encouraging aspects: • The datasets used represent real-world projects. • Problem dependent objectives represented in fitness functions perform better than standard GP.

Application of GP for sw cost/effort/size (CES) estimation (7 studies) • Variations of the dependent variable • Software effort • Software cost • Software size • Variations in fitness function • Single objective • Minimization of mean squared error or MMRE

Application of GP for sw cost/effort/size (CES) estimation cntd… • Variations in comparison groups • ANN, nearest neighbour and different forms of regression. • Variations in sampling of training and testing sets • Simple hold-out.

Application of GP for sw cost/effort/size (CES) estimation cntd… • Results • No strong evidence of GP performing consistently on all evaluation measures used. • Limitations • Evaluation measures used are not standardized. • Different hold-out samplings for train and test sets. • Lack of statistical hypothesis testing. • Lack of comparison groups.

Application of GP for sw fault prediciton and sw reliability growth modeling (8 studies) • Variations of the dependent variable • SW fault prediction • SW reliability growth modeling • Variations in fitness function • Single objective: • Minimization of standard error

Application of GP for sw fault prediciton and sw reliability growth modeling cntd … • Variations in comparison groups • Standard GP, Naive Bayes, traditional software reliability growth models. • Variations in sampling of training ad testing sets • Hold-out and 10-fold CV

Application of GP for sw fault prediciton and sw reliability growth modeling cntd … • Results: • 7 out of 8 studies favor the use of GP. • Limitations: • Poor representation of comparison groups • Absence of a baseline to compare to.

Promising future work to undertake • Multi-objective fitness evaluation (e.g. Minimization of standard error and maximization of correlation coefficient) • Simplification of GP solutions to help interpretation of relationships between variables. • Evaluation of techniques to minimize overfitting of GP solutions.

Conclusions • A total of 23 studies apply GP for predictive studies in sw engineering: • sw quality classification (8) • sw cost/effort/size estimation (7) • sw fault prediciton and sw reliability growth modeling (8) • There is evidence in support of using GP for: • sw quality classifiaction • sw fault prediction and SW reliability growth modeling • but not for: • sw cost/effort/size estimation.

Recommendations • Use public data sets wherever possible. • Apply commonly used sampling strategies. • Use techniques to avoid overfitting in GP solutions. • Report the settings of GP parameters. • Compare the performances against a commonly used baseline. • Use statistical experimental designs.

On the application of GP for software engineering predictive modeling: A systematic review