280 likes | 446 Views
A Metric for Software Readability. by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman. Readability. The human judgment of how easy a text is to understand A local, line-by-line feature Not related to the size of a program
E N D
A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman
Readability • The human judgment of how easy a text is to understand • A local, line-by-line feature • Not related to the size of a program • Not related to essential complexity of software
Readability and Maintenance • Reading code is the most time-consuming of all maintenance activities [J. Lionel E. Deimel 1985, D. R. Raymond 1991, S. Rugaber 2000] • 70% of software cost for maintenance [B. Boehm and V. R. Basili 2001] • Readability also correlates with software quality, code change and defect reporting
Problem Statement • How do we create a software readability metric that: • Correlates strongly with human annotators • Correlates strongly with software quality • Currently no automated readability measure for software
Contributions • A software readability metric that: • correlates strongly with human annotators • correlates strongly with software quality • A survey of 120 human readability annotators • A discussion of the features related to software readability
Readability Metrics for Natural Language • Empirical and objective models of readability • Flesch-Kincaid Grade Level has been used for over 50 years • Based on simple features: • Word length • Sentence length • Used by government agencies, MS Word, Google Docs
Experimental Approach • 120 human annotators were shown 100 code snippets • Resulting 12,000 readability judgments available online (ok, not really)
Snippet Selection - Goals • Length • Short enough to aid feature discrimination • Long enough to capture important readability considerations • Logical Coherence • Shouldn't span methods • Include comments adjacent comments • Avoid trivial snippets (e.g a group of import statements)
Snippet Selection - Algorithm • Snippet = 3 consecutive simple statements • Based on authors' experience • Simple statements are: declarations, assignments, function calls, breaks, continues, throws and returns • Other nearby statements are included: • Comments, function headers, blank lines, if, else, try-catch, switch, etc. • Snippets cannot cross scope boundaries
Readability Scoring • Readability was rated from 1 to 5 • 1 - “less readable” • 3 - “neutral” • 5 - “more readable”
Inter-annotator agreement • Good correlation needed for a coherent model • Pearson product-moment correlation coefficient • Correlation of 1 indicates perfect correlation • Correlation of 0 indicates only random correlation • Calculated for pair wise for all annotators • Average correlation of 0.56 • Typically considered “moderate to strong”
Readability Model Objective: • Mechanically predict human readability judgments • Determine code features that are predictive of readability Usage: • Use this model to analyze code (automate software readability metric)
Model Generation • Classifier “Machine learning algorithms” • Instances “Feature vector from a snippet” • Experiment procedure - training phase - set of instances with labeled “correct answer” - classify based on the score from the bimodal distribution
Model Generation (contd …) • Decide on a set of features that can be detected statically • These factors relate to structure, density, logical complexity, documentation of the analyzed code • Each feature is independent of the size/block of code
Model Generation (contd …) • Build a classifier on a set of features • Use 10-fold cross validation - random partitioning of data set into 10 subsets - train on 9 and test on 1 - repeat this process 10 times • Mitigate any bias from partitioning by repeating the 10-fold validation 20 times • Average the results across all of the runs
Results • Two relevant success metrics – precision & recall • Recall - % of snippets judged by annotators and classified by model as “more readable” • Precision – fraction of snippets judged by annotators and classified by model as “more readable” “Performance is measured by weighing together the f-measure statistic and harmonic mean of the two metrics”
Results (contd …) • “0.61” – f-measure of the classifier trained on randomly generated score labels • “0.8” – f-measure of the classifier trained on average human data
Results (contd …) • Repeated the experiment separately with annotator experience group (100 200 and 400 level, graduate CS students
Interesting Facts from performance measure … • Average line length and average number of identifiers per line are important to readability • Average identifier length, loops, if constructs and comparison operators are not very predictive features
Readability Correlations (Experiment 1) • Correlate defects detected by FindBugs* and readability metric • Run FindBugs on benchmarks • Classified the defects reports (one containing at least one defect and other containing none) • Run the trained classifier • Record the f-measure for “contains a bug” with respect to classifier judgment of “less readable” *FindBugs – a popular static bug finding tool
Readability Correlations (Experiment 2) • Correlates future code churn to readability • Uses readability to predict those functions that will be modified between 2 successive releases of a program • Consider a function to have changed • Where text is not exactly the same • Changes in whitespaces
Readability Correlations - Results • Average f-measure: • For Experiment 1 -> 0.61 and for Experiment 2 -> 0.63
Relating Metric to Software Life Cycle • Readability tends to change over a long period of time
Relating Metric to Software Life Cycle (contd …) • Correlate project readability against project maturity (as reported by developers) “Projects that reach maturity tend to be more readable”
Discussion • Identifier Length • No influence! • Long names can improve readability, but can also reduce it • Comments might be more appropriate • Author's suggestions: Improved IDEs and code inspections • Code Comments • Only moderately correlated • Being used to “make up for” ugly code? • Characters/identifiers per line • Strongly correlated • Just as long sentences are more difficult to understand, so are • long lines of code • Author's suggestion: keep lines short, even if it means breaking • them up over several lines
Related Work • Natural Language Metrics [R. F. Flesch 1948, R. Gunning 1952, J. P. Kincaid and E. A. Smith 1970, G. H. McLaughlin 1969] • Coding Standards [S. Ambler 1997, B. B. Bederson et al. 2002, H. Sutter and A. Alexandrescu 2004] • Style Checkers [T. Copeland 2005] • Defect Prediction [T. J. Cheatham et al. 1995, T. L. Graves et al. 2000, T. M. Koshgoftaar et al. 1996, J. P. Kincaid and E. A. Smith 1970]
Future Work • Examine personal preferences • Create personal models • Models based on application domain • Broader features • e.g. number of statements in an ifblock • IDE integration • Explore minimum set of predictive features
Conclusion • Created a readability metric based on a specific set of human annotators • This metric: • agrees with the annotators as much as they agree with each other • has significant correlation with conventional metrics of software quality • Examining readability could improve language design and engineering practice