Software Quality Analysis with Limited Prior Knowledge of Faults

Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn 313 583 6669 nseliya@umich.edu

Overview • Introduction • Knowledge-Based Software Quality Analysis • Software Quality Analysis with Limited Prior Knowledge of Faults • Empirical Case Study • Software Measurement Data • Empirical Results • Conclusion Wayne State University CS Seminar

Introduction • Software quality assurance is vital during the software development process • Knowledge-based software quality models useful for allocating limited resources to faulty programs • Software measurements often observed as predictors of software quality, i.e. working hypothesis • System operations and software maintenance are benefited by targeting program modules that are likely to have defects Wayne State University CS Seminar

Introduction … • Software quality modeling has been addressed in related literature • software quality classification models • software fault prediction models • software module order modeling • A supervised learning approach is typically taken for software quality modeling • requires the prior experience of developing systems relatively similar to target system • requires complete knowledge of defect data of previously developed program modules Wayne State University CS Seminar

Software Quality Analysis Software Metrics Known Defect Data Software Metrics Unknown Defect Data Learnt Hypothesis Model Training Model Application Previous Experience Target Project Wayne State University CS Seminar

Software Quality Analysis … • Practical software engineering problems • Organization has limited software defect data from previous experiences with similar software projects • Organization does not have software defect data from previous experiences with similar software projects • Organization does not have experience with developing similar software projects • Two very likely problem scenarios • Software quality modeling with limited software defect data • Software quality modeling without software defect data Wayne State University CS Seminar

Limited Defect Data Problem Known Defect Data Model Training Software Metrics Software Metrics Unknown Defect Data Learnt Hypothesis Model Application Previous Experience Unknown Defect Data Target Project Wayne State University CS Seminar

No Software Defect Data Problem Software Metrics Known Defect Data Software Metrics Unknown Defect Data Learnt Hypothesis Model Training Model Application Previous Experience Target Project Wayne State University CS Seminar

Limited Defect Data Problem … • Some contributing issues: • cost of metrics data collection may limit for which subsystems the software fault data is collected • software defect data collected for some modules may be error prone due to data collection problems • defect data may be reliable for only some components • only some project components of a distributed software system may collect software fault data • in a multiple release system, fault data may not be collected for all releases Wayne State University CS Seminar

Objectives • Developing solutions to software quality analysis when there is only limited a priori knowledge of defect data • Learning software quality trends from both labeled (small size) and unlabeled (large size) components of software measurement data • Providing empirical software engineering evidence toward effectiveness and practical appeal of proposed solutions Wayne State University CS Seminar

Proposed Solutions • Constraint-Based Clustering with Expert Input • Semi-Supervised Classification with the Expectation-Maximization Algorithm Wayne State University CS Seminar

Constraint-Based Clustering with Expert Input • Clustering is an appropriate choice for software quality analysis based on program attributes alone • Clustering algorithms group program modules according to their software attributes • Program modules with similar attributes will likely have similar software quality characteristics • Low-quality modules will likely group together into nfp clusters • High-quality modules will likely group together into fp clusters Wayne State University CS Seminar

Constraint-Based Clustering with Expert Input … • Labeled data instances are used to modify and enhance clustering results on the unlabeled data instances • Investigated unsupervised clustering with expert input for software quality classification • Constraint-based clustering can aid the expert in better labeling the clusters as fp or nfp • Identify difficult-to-classify modules or noisy instances in the software measurement data Wayne State University CS Seminar

Proposed Algorithm • A constraint-based clustering approach is implemented with the k-means algorithm • Labeled program modules are used to initialize centroids of a certain number of clusters • Grouping of the labeled modules remains unchanged as fixed constraints • Expert has the flexibility to inspect and label additional clusters as nfp or fp during the semi-supervised clustering process Wayne State University CS Seminar

Proposed Algorithm … Let D contain L_nfp, L_fp, and U sets of program modules • Obtain initial numbers of nfp and fp clusters: • Execute Cg algorithm to obtain optimal (p) number of nfp clusters among {1, 2, …, Cin_nfp} • Execute Cg algorithm to obtain optimal (q) number of fp clusters among {1, 2, …, Cin_fp} • Cg algorithm work of Krzanowski and Lai • A criterion for determining the number of groups in a data set using sums-of-squares clustering. Biometrics, 44(1):23-34, March 1988. Wayne State University CS Seminar

Proposed Algorithm … • Initialize centroids of clusters: • Centroids of p out of C_max clusters are initialized to centroids of nfp clusters • Centroids of q out of {C_max - p} clusters are initialized to centroids of fp clusters • Centroids of remaining r (i.e. C_max-p –q) clusters initialized to randomly selected modules from U • Randomly select 5 unique sets of modules for initializing the r unlabeled clusters Wayne State University CS Seminar

Proposed Algorithm … • Execute constraint-based clustering: • k-means with Euclidean distance run on D with initialized centroids of C_max clusters • Clustering is run under constraint that an existing membership of a module to a labeled cluster remains unchanged • Clustering repeated for all 5 centroid initialization settings • Clustering associated with median SSE value selected for subsequent computation Wayne State University CS Seminar

Proposed Algorithm … • Expert-based labeling of clusters: • Expert is presented with descriptive statistics of the r unlabeled clusters and asked to label them as nfp or fp • Expert labels only those clusters for which he is very confident in the label estimation • If at least 1 of the r clusters is labeled, go to to Step 2, and continue Wayne State University CS Seminar

Proposed Algorithm … • Stop semi-supervised clustering: • Iterative semi-supervised clustering process is stopped when the sets C_nfp, C_fp, and C_ul are unchanged • Program modules in the p (or q) clusters are labeled and recorded as nfp (or fp) • Program modules in the remaining r unlabeled clusters are not assigned any fault-proneness labels Wayne State University CS Seminar

Software Measurement Data • Software metrics datasets obtained from seven NASA software projects (MDP Initiative) • JM1, KC1, KC2, KC3, CM1, MW1, and PC1 • Projects characterized by same set of software product metrics and built in similar software development environments • Defect data reflect changes made to source code for correcting errors recorded in problem reporting systems Wayne State University CS Seminar

Software Measurement Data … • The JM1 project dataset used as training data in our empirical case studies • it is the largest dataset among the seven software projects • The remaining six datasets used as test data for model evaluation and generalization performance • Among the 21 product metrics only 13 basic metrics are used in our study Wayne State University CS Seminar

Cyclomatic complexity Essential complexity Design complexity Number of unique operators Number of unique operands Total number of operators Total number of operands Total number of lines of source code Executable lines of code Lines with code and comments Lines with only comments Blank lines of code Branch count Software Metrics Wayne State University CS Seminar

Case Study Datasets Wayne State University CS Seminar

Constraint-Based Clustering Case Study • JM1 dataset pre-processed to yield a reduced dataset of 8850 modules, i.e. JM1-8850 • program modules with identical software attributes but with different fault-proneness labels were eliminated • JM1 used as training instances and to form the respective labeled & unlabeled datasets • KC1, KC2, KC3, CM1, MW1, & PC1 used as test datasets to evaluate knowledge learnt post constraint-based clustering analysis of software data Wayne State University CS Seminar

Constraint-Based Clustering Case Study … • Labeled datasets formed by random sampling • LP = {100, 250, 500, 1000, 1500, 2000, 2500, 3000} labeled modules • Each LP dataset randomly selected to maintain a 80:20 proportion of nfp:fp program modules • 3 samples were obtained for each LP value, and average results are reported in the paper • 5 samples for LP = {100, 250, 500} • Parameter settings • C_max = {30, 40} clusters • Cin_fp = {10, 20} for Cg algorithm • Cin_nfp = {10, 20} for Cg algorithm Wayne State University CS Seminar

Initial Clusters of Labeled Modules Wayne State University CS Seminar

Expert-Based Labeling Results Wayne State University CS Seminar

Classification of Labeled Modules Wayne State University CS Seminar

Classification of Labeled … Wayne State University CS Seminar

Classification of Labeled … C_max = 30 Clusters Wayne State University CS Seminar

Classification of Labeled … C_max = 40 Clusters Wayne State University CS Seminar

Classification of Test Datasets Unsupervised Clustering Wayne State University CS Seminar

Classification of Test Datasets … Constraint-Based Clustering (LP = 250) Wayne State University CS Seminar

Classification of Test Datasets … Average Classification of Test Data Modules Wayne State University CS Seminar

Comparison with C4.5 Models • C4.5 decision tree implemented in Weka, an open source data mining tool • Supervised decision tree models built using 10 fold cross validation • Decision tree parameters tuned for appropriate comparison with constraint-based clustering • Tuning for similar Type I (false positive) error rates • C4.5 models yielded very low false positives in conjunction with very high false negatives • Performance of C4.5 models generally remain unchanged with LP compared to an improvement by constraint-based clustering Wayne State University CS Seminar

Remaining Unlabeled Modules • do they constitute as noisy data? are they hard to model modules? • do they form new groups of program modules for given system ? • are their software measurements uniquely different from the other program modules? • did something go wrong in the software metrics data collection process? • did the project not collect other software metrics that may better represent the software quality? Wayne State University CS Seminar

Remaining Unlabeled Modules … • Ensemble Filter (EF) strategy • Comparison with majority EF • Consists of 25 classifiers from different learning theories and methodologies • Investigate commonality of modules detected by EF and those that remain unlabeled after constraint-based clustering process • About 40% to 50% were common with those considered noisy by ensemble filter • A relatively large number of same modules were consistently included in the pool of remaining unlabeled program modules Wayne State University CS Seminar

Constraint-Based Clustering Case Study … Summary • Improved estimation performance compared to unsupervised clustering with expert input • Better test data performances compared to a supervised learner trained on labeled dataset • For larger labeled datasets, generally improved performance compared to EM-based semi-supervised classification • Several of remaining modules are likely to constitute as noisy data, providing insight into their attributes Wayne State University CS Seminar

Semi-Supervised Classification with the EM Algorithm • Learning from a small labeled and a large unlabeled software measurement dataset • Expectation Maximization (EM) algorithm for building semi-supervised software quality classification models • Improve the supervised learner with knowledge stored in software attributes of the unlabeled program modules • The labeled dataset is iteratively augmented with program modules in unlabeled dataset Wayne State University CS Seminar

Semi-Supervised Classification with the EM Algorithm … {100, 250, 500, 1000, 1500, 2000, 2500, 3000} Labeled Program Modules EM Algorithm for Estimating Class Labels Unlabeled Program Modules Selected Unlabeled Modules Confidence Based Selection JM1-8850 Wayne State University CS Seminar

Semi-Supervised Classification with the EM Algorithm … • Proposed semi-supervised classification process improved generalization performance • Semi-supervised software quality classification models generally yielded better performance than C4.5 decision trees • About 40 to 50% of the remaining modules are likely to constitute as noisy data • Number of unlabeled modules selected for augmentation was largest when LP = 1000 Wayne State University CS Seminar

Conclusion • Practical solutions to problem of software quality analysis with limited a priori knowledge of defect data • Empirical investigation with software measurement data from real world projects • Constraint-based clustering vs. Semi-supervised classification • Constraint-based clustering generally yielded better performance than • EM-based semi-supervised classification has lower complexity • Semi-supervised classification with EM allows for control of relative balance between the Type I and Type II error rates Wayne State University CS Seminar

Some Future Work • Applying the limited defect data problem to quantitative software fault prediction models • A software engineering study on characteristics of program modules that remain unlabeled • Investigate software development process • Exploring semi-supervised learning schemes for detecting noisy instances in a dataset • Investigating self-labeling heuristics for minimizing expert involvement in the unsupervised and constraint-based clustering approaches Wayne State University CS Seminar

Other SE Research Focus • Knowledge-based software security modeling and analysis • Studying the influence of diversity in pair programming teams • Software engineering measurements for agile development teams and projects • Software forensics with cyber security and education applications Wayne State University CS Seminar

Software Quality Analysis with Limited Prior Knowledge of Faults Questions !!! Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn 313 583 6669  nseliya@umich.edu

Software Quality Analysis with Limited Prior Knowledge of Faults

Software Quality Analysis with Limited Prior Knowledge of Faults

Presentation Transcript

Prior Knowledge Assessment

Prior Knowledge

Fixing Faults With Limited Funds

Any Prior Knowledge?

Access Prior Knowledge

Prior Knowledge!

Prior Knowledge

Prior Knowledge

CLASSIFICATION Prior Knowledge

Access Prior Knowledge

Prior Knowledge

Faults - analysis

Prior knowledge

Activating Prior Knowledge

Prior knowledge necessary

Developing Prior Knowledge with Primary Sources

Activate Prior Knowledge

Prior Knowledge

Activate Prior Knowledge

Document Clustering with Prior Knowledge

Effective Dimension Reduction with Prior Knowledge

Activate Prior Knowledge