220 likes | 435 Views
A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques. Sarah Heckman and Laurie Williams Department of Computer Science North Carolina State University. Motivation. Automated static analysis can find a large number of alerts
E N D
A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science North Carolina State University
Motivation • Automated static analysis can find a large number of alerts • Empirically observed alert density of 40 alerts/KLOC[HW08] • Alert inspection required to determine if developer should (and could) fix • Developer may only fix 9%[HW08] to 65%[KAY04] of alerts • Suppose 1000 alerts – 5 minute inspection per alert – 10.4 work days to inspect all alerts • Potential savings of 3.6-9.5 days by only inspecting alerts the developer will fix • Fixing 3-4 alerts that could lead to field failures justifies the cost of static analysis[WDA08] PROMISE 2013 (c) Sarah Heckman
Coding Problem? • Actionable: alerts the developer wants to fix • Faults in the code • Conformance to coding standards • Developer action: fix the alert in the source code • Unactionable: alerts the developer does not want to fix • Static analysis false positive • Developer knowledge that alert is not a problem • Inconsequential coding problems (style) • Fixing the alert may not be worth effort • Developer action: suppress the alert PROMISE 2013 (c) Sarah Heckman
Actionable Alert Identification Techniques • Supplement automated static analysis • Classification: predict actionability • Prioritization: order by predicted actionability • AAIT utilize additional information about the alert, code, and other artifacts • Artifact Characteristics • Can we determine a “best” AAIT? PROMISE 2013 (c) Sarah Heckman
Research Objective • to inform the selection of an actionable alert identification technique for ranking the output of automated static analysis through a comparative evaluation of six actionable alert identification techniques. PROMISE 2013 (c) Sarah Heckman
Related Work • Comparative evaluation of AAIT [AAH12] • Languages: Java and Smalltalk • ASA: PMD, FindBugs, SmallLint • Benchmark: FAULTBENCH • Evaluation Metrics • Effort – “average number of alerts one must inspect to find an actionable one” • Fault Detection Rate Curve – number of faults detected against number alerts inspected. • Selected AAIT: APM, FeedbackRank, LRM, ZRanking, ATL-D, EFindBugs PROMISE 2013 (c) Sarah Heckman
Comparative Evaluation • Considered AAIT in literature [HW11][SFZ11] • Selection Criteria • AAIT classify or prioritize alerts generated by automated static analysis for the Java programming language • An implementation of the AAIT is described allowing for replication • The AAIT is fully automated and does not require manual intervention or inspection of alerts as part of the process PROMISE 2013 (c) Sarah Heckman
Selected AAIT (1) • Actionable Prioritization Models (APM) [HW08] • ACs: code location, alert type • Alert Type Lifetime (ATL) [KE07a] • AC: alert type lifetime • ATL-D: measures the lifetime in days • ATL-R: measures the lifetime in revisions • Check ‘n’ Crash (CnC) [CSX08] • AC: test failures • Generates tests that try to cause RuntimeExceptions PROMISE 2013 (c) Sarah Heckman
Selected AAIT (2) • History-Based Warning Prioritization (HWP) [KE07b] • ACs: commit messages that identify fault/non-fault fixes • Logistic Regression Models (LRM) [RPM08] • ACs: 33 including two proprietary/internal AC • Systematic Actionable Alert Identification (SAAI) [HW09] • ACs: 42 • Machine learning PROMISE 2013 (c) Sarah Heckman
FAULTBENCH v0.3 • 3 Subject Programs: jdom, runtime, logging • Procedure • Gather Alert and Artifact Characteristic Data Sources • Artifact Characteristic and Alert Oracle Generation • Training and Test Sets • Model Building • Model Evaluation PROMISE 2013 (c) Sarah Heckman
Gather Data • Download from repo • Compile • ASA – FindBugs & Check ‘n’ Crash (ESC/Java) • Source Metrics – JavaNCSS • Repository History – CVS & SVN • Difficulties • Libraries – changed over time • Not every revision would build (especially early ones) PROMISE 2013 (c) Sarah Heckman
Artifact Characteristics Independent Variables Alert Identifier and History • Alert information (type, location) • Number of alert modifications Source Code Metrics • Size and complexity metrics Source Code History • Developers • File creation, deletion, and modification revisions Source Code Churn • Added and deleted lines of code Aggregate Characteristics • Alert lifetime, alert counts, staleness Dependent Variable – Alert Classification Alert Info Surrounding Code Alert Actionable Alert Unactionable Alert PROMISE 2013 (c) Sarah Heckman
Alert Oracle Generation • Iterate through all revisions, starting with the earliest, and compare alerts between revisions • Closed Actionable • Filtered Unactionable • Deleted • Open • Inspection • All unactionable Filtered Open Closed Deleted PROMISE 2013 (c) Sarah Heckman
Training and Test Sets • Simulate how AAIT would be used in practice • Training set: first X% of revisions to train the models • 70%, 80%, and 90% • Test set: use remaining 100-X% of revisions to test the models • Overlapping alerts • Alerts open at the cutoff revision • Deleted alerts • If an alert is deleted, the alert is not considered UNLESS the alert isn’t deleted in the training set. In that case the alert is used in model building. PROMISE 2013 (c) Sarah Heckman
Model Building & Model Evaluation • All AAIT are built using the training data and evaluated by predicting the actionability of the test data • Classification Statistics: • Precision = TP / (TP + FP) • Recall = TP / (TP + FN) • Accuracy = (TP + TN) / (TP + TN + FP + FN) PROMISE 2013 (c) Sarah Heckman
Results - jdom PROMISE 2013 (c) Sarah Heckman
Results - runtime PROMISE 2013 (c) Sarah Heckman
Results - logging PROMISE 2013 (c) Sarah Heckman
Threats to Validity • Internal Validity • Automation of data generation, collection, and artifact characteristic generation • Alert oracle – uninspected alerts are considered unactionable • Alert closure is not an explicit action by the developer • Alert continuity not perfect • Close and open a new alert if both the line number and source hash of the alert change • Number of revisions • External Validity • Generalizability of results • Limitations of the AAIT in comparative evaluation • Construct Validity • Calculations for artifact characteristics PROMISE 2013 (c) Sarah Heckman
Future Work • Incorporate additional projects into FAULTBENCH • Emphasis on adding projects that actively use ASA and include filter files • Allow for evaluation of AAIT with different goals • Identification of most predictive artifact characteristics • Evaluate different windows for generating test data • A full project history may not be as predictive as the most recent history PROMISE 2013 (c) Sarah Heckman
Conclusions • SAAI found to be the best overall model when considering accuracy • Highest accuracy, or tie, for 6 of 9 treatments • ATL-D, ATL-R, and LRM were also predictive when considering accuracy • CnC also performed well, but only considered alerts from one ASA • LRM and HWP had the highest recall PROMISE 2013 (c) Sarah Heckman
References [AAH12] S. Allier, N. Anquetil, A. Hora, S. Ducasse, “A Framework to Compare Alert Ranking Algorithms,” 2012 19th Working conference on Reverse Engineering, Kingston, Ontario, Canada, October 15-18, 2012, p. 277-285. [CSX08] C. Csallner, Y. Smaragdakis, and T. Xie, "DSD-Crasher: A Hybrid Analysis Tool for Bug Finding," ACM Transactions on Software Engineering and Methodology, vol.17, no. 2, pp. 1-36, April, 2008. [HW08] S. Heckman and L. Williams, "On Establishing a Benchmark for Evaluating Static Analysis Alert Prioritization and Classification Techniques," Proceedings of the 2nd International Symposium on Empirical Software Engineering and Measurement, Kaiserslautern, Germany, October 9-10, 2008, pp. 41-50. [HW09] S. Heckman and L. Williams, "A Model Building Process for Identifying Actionable Static Analysis Alerts," Proceedings of the 2nd IEEE International Conference on Software Testing, Verification and Validation, Denver, CO, USA, 2009, pp. 161-170. [HW11] S. Heckman and L. Williams, "A Systematic Literature Review of Actionable Alert Identification Techniques for Automated Static Code Analysis," Information and Software Technology, vol. 53, no. 4, April 2011, p. 363-387. [KE07a] S. Kim and M. D. Ernst, "Prioritizing Warning Categories by Analyzing Software History," Proceedings of the International Workshop on Mining Software Repositories, Minneapolis, MN, USA, May 19-20, 2007, p27. [KE07b] S. Kim and M. D. Ernst, "Which Warnings Should I Fix First?," Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia, September 3-7, 2007, pp. 45-54. [KAY04] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, "Correlation Exploitation in Error Ranking," Proceedings of the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Newport Beach, CA, USA, 2004, pp. 83-93. [RPM08] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, G. Rothermel, “Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach,” Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, May 10-18, 2008, pp. 341-350. [SFZ11] H. Shen, J. Fang, J. Zhao, “EFindBugs: Effective Error Ranking for FindBugs,” 2011 IEEE 4th International Conference on Software Testing, Verification and Validation, Berlin, Germany, March 21-25, 2011, p. 299-308. [WDA08] S. Wagner, F. Deissenboeck, M. Aichner, J. Wimmer, M. Schwalb, “An Evaluation of Two Bug Pattern Tools for Java,” Proceedings of the 1st International Conference on Software Testing, Verification, and Validation, … PROMISE 2013 (c) Sarah Heckman