240 likes | 465 Views
Evaluating Coding Standards. Relating Violations and Observed Faults. Cathal Boogerd, Software Evolution Research Lab (SWERL). Coding Standards. A Notorious Subject. Put the name of a well-known code inspection tool on a poster in the middle of a software development department…
E N D
Evaluating Coding Standards Relating Violations and Observed Faults Cathal Boogerd, Software Evolution Research Lab (SWERL)
Coding Standards A Notorious Subject • Put the name of a well-known code inspection tool on a poster in the middle of a software development department… • Lots of discussion! • Developer: “I have to do this, but I don’t have time” • Architect: “Quality assessed by stupid rules” • QA Manager: “Difficult to get people to use the tool”
Coding Standards Pros; why bother? • Rules often based on expert consensus • Gained after long years of experience with ‘faulty’ constructs • Using intricate knowledge of a language • Rules usually rather straightforward • Making automatic detection feasible • Many tools exist with pre-defined rulesets and support for customization • QA-C, Codesonar, Findbugs, and many more • Clearly, this is a simple and sensible preventive approach • Or is it?
Coding Standards Cons; please get those tools away from me! • Automatic code inspection tools often produce many false positives • Situations where it is difficult to see the potential link to a fault • Cases where developers know the construct is harmless • Solutions to reported violations can take the form of ‘tool satisfaction’ • Developers find a workaround to silence the tool, rather than think about what is actually going on • Any modification has a non-zero probability of introducing a fault • No empirical evidence supporting the intuition that rules prevent faults!
Concepts Implicit basic idea and its consequences • Violations to coding standard rules point out potential faults • All statements are potentially faulty, but… • Lines with violations are more likely to be faulty than lines without • Releases with more violations contain more (latent) faults • Intuitive for two releases of the same software • But: have to account for size, use densities instead • Modules within one release with higher vd have higher fd • This would point out potential problem areas in the software • How to gather empirical evidence for these ideas? • Just put a question mark behind them…
Research Questions Temporal aspect: Do rule violations explain occurrences of faults across releases? On a project level: rank correlations of releases Spatial aspect: Do rule violations explain locations of faults within releases? Different levels of granularity: rank correlations of file, module Combined: Do rule violations explain locations of faults across releases? On a line level: true positive rates for violations We investigate this for the body of violations as a whole, as well as for individual rules
Measurement Approach Repository Mining Config Extraction Release Cross-release correlations Code Inspection Source Repos File- versions Other Metrics TP rates & In-release correlations Issue DB Issues Line Tagger
Case Study TVoM: Platform for watching TV on a mobile phone DRiver Abstraction Layer (DRAL): approx. 90KLoC in C 214 daily build releases, ~460 PRs Vproc: video processing part of TV software Vproc (developed in Ehv): approx. 650KLoC in C 41 internal releases, ~310 PRs SCM: Telelogic Synergy Features link between issue reports and modified files Both embedded software projects within NXP, but: Vproc larger and more mature (productline vs new project) Projects: TVoM and Vproc
Case Study Coding Standard: MISRA-C: 2004 • Coding standard based on the notion of a safer language subset • Banning potentially unsafe constructs • Placing special emphasis on automatic checking • Initially meant for the automotive industry • MISRA-C 98 by MIRA, a UK-based consortium of automotive industries • Widely adopted in industry, also outside automotive • In 2004 the current, revised version was released
Measurement Approach Temporal Aspect • Obtain the following measures per release: • Number of violations, number of faults, and size (LoC) • Violations (also per rule): • Measure by running code inspection tool for MISRA (QA-C) • Faults: • Estimate by taking number of open issues at release date • This is a conservative approximation! • Size: • Measure the number of physical lines of code • We opt for physical, since rules need not be limited to statements • Note that this does not require the issue database
Measurement Approach Spatial Aspect • Similarly, we need to obtain faults/violations per file/module • Measure violation density as before • Estimating the number of faults by tracking faulty lines • Extract all used files from the selected releases • Retrieve all versions of those files from the repository • Create a file-version graph with a diff for each edge • Use file-version graph and diffs to track faulty lines to their origin • Fault is assumed to be present from first occurrence of one of its constituting lines until conclusion of the issue
Measurement Approach File-version and annotation graphs foo.c, 2.1.1 foo.c, 2.1.2 foo.c, 2.1.3 foo.c, 1 foo.c, 2 foo.c, 2.2.1 foo.c, 2.2.2.1.1 foo.c, 2.2.3 foo.c, 2.2.2.2.1
Measurement Approach Temporal-Spatial Aspect • Matches violations and faults on a line-basis, by tracking violations similar to faults in the Spatial approach • The true positive rate is # true positives / # violations • A ‘true positive’ is a violation that correctly predicted the line containing it to be faulty, i.e. part of a bug fix • The number of violations is the unique number over the whole history • Defined by the violation id and the specific line containing it • How to assess the significance of the true positive rate?
Measurement Approach Significance of line-based prediction • Suppose a certain rule marks every line as a violation… • In this case the true positive rate will be equal to the faulty line ratio • In general: a random line predictor will end up around that ratio • Given a sufficient number of attempts • We need to determine whether violations outperform a uniform random line predictor • Random predictor can be modeled as a Bernoulli process • p = faulty line ratio, #attempts = #violations, #successes = #TPs • Distribution is binomial, use CDF to determine significance of #TPs
Results for TVoM Evolution of measures over time violations size (loc) faults
Results for TVoM No relation in the first part of the project, but there is one in the second part Rank correlation: 0.76, R2 = 0.57, significant Individual rules: 73 unique rules found 9 negative, 23 none, 41 positive Cross-release correlation
Results for TVoM True positive rates • Out of 72 rules, 13 had a TP > faulty line rate (0.17) • Of which 11 significant with α = 0.05 • Although better than random, this does not say anything about applicability • For instance, rule 14.2 has 260 violations, of which 70% false positive • On average, this requires about 10 tries before one is successful • To be relatively sure (α = 0.05) requires selection of 26 violations • However, work load issues can be addressed by process design • Automatic run of code inspection upon check-in • Developer would only inspect violations of his delta (more context) • In that case, true positive rates can be useful for prioritization
Results for Vproc Evolution of faults and violations over time violations size (loc) faults
Results for Vproc No overall relation, only for some rules Individual rules 89 distinct rule violations 15 negative, 59 none, 15 positive Cross-release correlation
Results for Vproc True positive rates • Out of 78 rules, 55 had a TP > faulty line rate (0.0005) • Of which 29 significant with α = 0.05 • Faulty line rate is very different from TVoM! • Mature code, many files never modified • Does the assumption of uniform distribution still hold? • Analyzed addition in isolation (i.e., modified files only) • Faulty line rate becomes 0.06 • Now only 40 rules have a TP > faulty line rate, 14 significant • NB: some rules have very few violations • Easily outperforms random predictor (but not significant)
Conclusions Lessons learned • Found some evidence for a relation between violations and faults • In both cases, but especially in TVoM • At this point, no pattern of rules stands out • However, no consistent behavior of rules for the two cases • More cases are needed to increase confidence in results • A priori rule selection currently not possible • Temporal method easier to apply, but some problems: • Inaccurate estimate of number of faults • Too sensitive to changes other than fault-fixes
Conclusions Lessons learned • Note that (negative) correlations do not mean causation! • Write a C++ style comment for every fault fix • Other (non-fix) modifications might obscure the correlation • Spatial methods may be too restrictive • Not all modified/deleted lines in a fault-fix are faulty • Sometimes fault-fixes only introduce new code; unable to locate fault • Must take care in selection of codebase to analyze • Preliminary in-release results indicate no correlation
Conclusions Final remarks • Note that there may be more reasons than fault prevention to adhere to a coding standard • Maintainability: readability, common style • Portability: minimize issues due to compiler changes • Nevertheless, quantification of fault prevention can be an important asset in the cost-benefit analysis of adherence • You may have noticed not all results were in the slides: work in progress!