1 / 18

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects. Erika Camargo and Ochimizu Koichiro Japan Institute of Science and Technology. ESEM 2009. Contents. Abstract Background Problem Analysis Case study Results Conclusion and Future Work.

Download Presentation

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and OchimizuKoichiro Japan Institute of Science and Technology ESEM 2009

  2. Contents • Abstract • Background • Problem Analysis • Case study • Results • Conclusion and Future Work

  3. Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(Fault prone class) X = design-complexity metric P(y=1) x

  4. Background • Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models • Among these metrics are the Chidamber & Kemerer (CK) metrics • 80th and 20th percentiles of the distributions can be used to determine high and low values • Their thresholds cannot be determined before their use and should be derivedand used locally

  5. Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTY MOST FAULTY P (y=1) Large Size SW project Small Size SW project X = Number of Methods 20 10

  6. Case Study • Data analysis of 7 different projects andapplication of simple log data transformations. • Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes). • Dependent Variables: CK-CBO, CK-RFC, CK-WMC • Independent Variables: Defects (from Bugzilla & CVS) • Test these models with 2 other smaller projects (with 11 and13 Java classes)

  7. Challenge BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** produced biased regression estimates and reduce the predictive power of regression models (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

  8. BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

  9. BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

  10. Case Study Solution. Simple data transformation using “Log10” Example : • Number of Outliers are less • Data Spread is more uniform LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

  11. Results Effects of the Log data Transformations: • Elimination of great number of outliers • Overall goodness of fit of the 3 models is better • Discrimination (Most Faulty/Least Faulty) • All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System • What about using different projects?

  12. Results MF: Most Faulty LF: Least Faulty BANKING SYSTEM

  13. Results MF: Most Faulty LF: Least Faulty E-COMMERCE SYSTEM

  14. Conclusions and Future work • CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects • Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. • Further data exploration and study of data transformations

  15. Thank you! questions, comments … contact: erika.camargo@jaist.ac.jp

More Related