180 likes | 283 Views
Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects. Erika Camargo and Ochimizu Koichiro Japan Institute of Science and Technology. ESEM 2009. Contents. Abstract Background Problem Analysis Case study Results Conclusion and Future Work.
E N D
Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and OchimizuKoichiro Japan Institute of Science and Technology ESEM 2009
Contents • Abstract • Background • Problem Analysis • Case study • Results • Conclusion and Future Work
Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(Fault prone class) X = design-complexity metric P(y=1) x
Background • Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models • Among these metrics are the Chidamber & Kemerer (CK) metrics • 80th and 20th percentiles of the distributions can be used to determine high and low values • Their thresholds cannot be determined before their use and should be derivedand used locally
Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTY MOST FAULTY P (y=1) Large Size SW project Small Size SW project X = Number of Methods 20 10
Case Study • Data analysis of 7 different projects andapplication of simple log data transformations. • Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes). • Dependent Variables: CK-CBO, CK-RFC, CK-WMC • Independent Variables: Defects (from Bugzilla & CVS) • Test these models with 2 other smaller projects (with 11 and13 Java classes)
Challenge BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** produced biased regression estimates and reduce the predictive power of regression models (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
Case Study Solution. Simple data transformation using “Log10” Example : • Number of Outliers are less • Data Spread is more uniform LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed
Results Effects of the Log data Transformations: • Elimination of great number of outliers • Overall goodness of fit of the 3 models is better • Discrimination (Most Faulty/Least Faulty) • All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System • What about using different projects?
Results MF: Most Faulty LF: Least Faulty BANKING SYSTEM
Results MF: Most Faulty LF: Least Faulty E-COMMERCE SYSTEM
Conclusions and Future work • CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects • Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. • Further data exploration and study of data transformations
Thank you! questions, comments … contact: erika.camargo@jaist.ac.jp