300 likes | 418 Views
PDA 3.0 Prune Dependency Analyzer. Jason Scherer CS 6901, Spring 2008 Columbia University Professor Al Aho Advisor Marc Eaddy. Background.
E N D
PDA 3.0Prune Dependency Analyzer Jason Scherer CS 6901, Spring 2008 Columbia University Professor Al Aho Advisor Marc Eaddy
Background • Marc Eaddy, my advisor and mentor for this independent study, built a Prune Dependency Analyzer as part of his PhD research in concern analysis with Professor Al Aho. • The PDA tool is a static analysis tool which reads source code for a program and combines it with a list of “concern assignments” for that program. • After performing the analysis, the PDA tool should output an expanded, more complete, and more accurate list of concern assignments. • A concern assignment is a method, field, or type name taken from the code combined with a list of identifiers indicating which concerns it is associated with. • A “concern” is a program requirement, usually taken from a requirements document or formal specification. • The PDA tool expands the list of concern assignments (or “mappings”) by, in effect, postulating what would happen if that concern were pruned (i.e. removed).
Goal • Marc asked me to come aboard and help build a better-architected,more extensible, industrial strength version of the existing Prune Dependency Analyzer tool, using state-of-the-art compiler framework technology. • Our secondary goal was to improve upon the performance of the existing PDA tool. We were able to measure performance by running the tool on the source code for a real program (in this case, the Rhino javascript engine), and comparing the tool’s results with a complete concern mapping manually created by Marc.
Why rewrite the PDA? • Marc’s idea was to rewrite the PDA using a compiler framework. • Today’s compiler frameworks provide sophisticated analysis tools for working with compiled programs. • In a compiler framework, the program’s IR is represented in memory by a rich data structure containing an object for each element in the program. This allows for greater flexibility than the existing tool. • Compiler frameworks provide useful APIs for performing tasks such as data flow analysis, creating call graphs, etc.
The Soot compiler framework • Soot, from McGill University, is written in, and operates on, Java programs. • Soot compiles Java programs to an intermediate representation called jimple. Within jimple, methods are represented by the SootMethod class, fields by the SootField class, etc. • Soot provides a lot of features, not all of which we used for this project. We did, however, rely heavily on the call graph construction package, as well as the jimple itself. • As part of its call-graph suite, Soot includes a points-to analysis package. Within a Java context, points-to analysis essentially means devirtualizing virtual method calls.
Constant elimination • One problem with Soot which we recognized early on was that the compiler would convert references to final fields into constants in the code – but in order to correctly perform concern dependency analysis, we had to know about the field, not its constant value. • Troubleshooting this issue involved extensive analysis of the Soot source code. • We traced this constant substitution back to the Polyglot front end used by Soot.
Polyglot • When compiling directly from java source code (as opposed to reverse-compiling from classfiles), Soot uses the Polyglot framework (from Cornell) to parse the source. • Polyglot is a front-end compiler framework which provides a rich AST model. • Soot uses the AST output by Polyglot in order to construct its jimple representation of the compiled program. • Polyglot is an interesting tool in its own right, and could have future applications to this project, such as in the creation of Java language extensions that have to do with concern analysis.
Mini-Prototyping • All work done with the Soot API was first tested in its own separate mini-prototype executable before merging into the main codebase. • Writing all code twice was especially helpful in light of the fact that Soot documentation was sparse, and much of the code was modified from example code that came with the distribution, so mini-prototyping helped to establish what the code actually does.
Build and Test Environment • Having a robust build and testing environment is arguably the most important component in creating a maintainable and extensible codebase. • We used classic Makefiles and Makefile includes. • Each and every class is required to contain a main method, which should perform unit testing on that class and print the results to stdout. • Unit tests are part of the build toolchain – if the unit test fails, the build stops. • Currently most of the unit tests are stubbed or commented out: they should be filled in when large development changes settle into small maintenance changes.
AST Transformation • Polyglot uses the visitor pattern to perform transformations on the AST. We created two visitors to modify the AST. • The FinalFieldVisitor makes final fields non-final and makes constants that derived from them into non-constants. • The SwitchVisitor converts switch statements into if-then chains (we had to do this because references to final fields in “case” statements are compiled down to lookups or tables, not field references or even constants – so the FinalFieldVisitor did not avail in this case)
SootCallGraphBuilder • Most of the ugly details of building a call graph with Soot are confined to this class. • We had to create special subclasses of Soot classes in order to “intercept” the polyglot AST before it was passed to the Soot parser. The interception happens here. Once the AST is intercepted, it is transformed using the aforementioned classes. • Points-to analysis may be done here, depending on whether its corresponding option was turned on in the command line parameters passed to PDA 3.0.
PDGraphBuilder • The source file for this class is the biggest source file in the project and should probably be refactored into its own package. • PDGraphBuilder simultaneously walks the call graph and reads the jimple IR in order to create PDGraphNodes. So at any given time there is a pointer to a call graph node, and a pointer to a jimple method. • When it reads through the opcodes in the jimple method, every function call is cross-referenced with the out edges from the call graph node, and matched with an outgoing edge. • Only edges to relevant methods are traversed (e.g. Rhino, but not Java API methods) • Newly created PD Graph nodes are added to a PDGraph object and returned.
PDGraph • A PDGraph object contains references to the PDGraphEdge and PDGraphNode objects that make up the graph. • There are nodes for methods, fields, and types. • There are edges representing calls, field-gets, field-sets, type references, ownership by a type, subclassing, and implementing an interface
Predicates • One of the improvements of PDA 3.0 over earlier versions is that we are modeling concern assignments as Boolean predicates rather than as simple lists. For example, a program element could be associated with concerns A or B and C (A|BC) • There are several classes that model predicates (Conjunction, Disjunction, etc.) which are all subclasses of Predicate. • Program elements (and, by extension, PDGraphNodes) have two separate predicates associated with them: the remove predicate and the alter predicate. • The remove predicate is the concern pruning condition under which that program element is removed. • The alter predicate is the concern pruning condition under which that program element needs to be altered.
Analysis Phase Architecture • To make the tool extensible, we created an architecture whereby the tool has a library of “phases” which represent different analyses that you can perform on the input. • Command line options will control which phases are run*. • A phase can operate on nodes of the graph, edges of the graph, or both. *not yet implemented
Analysis Algorithm • The analysis algorithm is similar to the standard algorithm for data flow analysis as detailed in the Dragon book, but it is not identical. • The tool loops over all nodes in the PDGraph until no more work is done (no predicate modifications are made) by the current AnalysisPhase object. • Subclasses of AnalysisPhase must implement a predicate transformation that is guaranteed to ultimately reach a fixpoint. • There is an outer loop which iterates over all AnalysisPhase objects until no more work is done by any phases. This makes the order of the analysis phases irrelevant.
Remove Conversion Phase • This subclass of AnalysisPhase propagates remove predicates at a given node up to the alter predicates of any nodes that reference it. • For example: • method foo calls bar • method foo calls baz • bar has remove predicate A • baz has remove predicate B • RESULT: foo gets alter predicate A|B • This can be understood intuitively by realizing that if either bar or baz is removed, then foo must be altered (because it contains calls to those removed methods)
Dominator Heuristic Phase • This subclass of AnalysisPhase captures the intuitive notion that if removing all methods associated with concern A results in dead code, than any dead code left by that removal is also associated with concern A. • The current implementation merely looks at the predecessors for any given node, and if they are all associated with concern A, it propagates concern A down to the remove predicate of the current node. • Another way to implement this is to construct a dominator tree from any node with a given remove predicate, and propagate predicates for any node to all its children. However this only captures the notion of what happens when one method is removed, not all methods associated with a given concern. • This phase is currently under development.
Objective 1:To create a more robust codebase for the Concern Analysis Tool
Did we meet it? • This is open to interpretation, but there are several arguments for the case that we did: • We created a flexible and extensible architecture • We used a state-of-the-art compiler framework • We created a comprehensive build and test environment • We used a language that makes maintenance of large-scale applications possible (Java)
Objective 2:To improve on the results of the original Concern Analysis Tool
Test Results: f-measure PERL implementation (using only the “refs” clause) Baseline implementation Bug fixes Bug fixes Bug fixes Dominator Heuristic version 1
Meaning of the Test Results • Recall is the percentage of the real concern mappings which we find out about using the analysis (e.g. “the tool detected 30% of the mappings using, but it missed the rest”). This number includes whatever mappings we already knew about. • Precisionis the percentage of the concern mappings that we output which are actually real (e.g. “of the mappings we output, 50% were real and the rest were bogus”). • the f-measure is the harmonic mean of recall and precision, e.g. 2 * (recall * precision) / (recall + precision) • We use the harmonic mean to average the two scores because recall and precision are ratios (e.g. < 1)
Upshot of the Test Results • We achieved near parity with the previous implementation but did not exceed the previous results in terms of f-measure. • Because we achieved near parity using the same algorithm as the previous implementation, this is a good indication that our implementation has few bugs as-is (provided we can say that the previous implementation had few bugs) • Adding the dominator heuristic (which was not part of the previous implementation) did not have a significant impact on the results. • This is not an ideal situation, but at least it gives us security in knowing that we can continue to add analysis phases to PDA 3.0 and potentially see an increase in accuracy of results.
The Takeaway • Robust compiler frameworks are useful, but they require a significant time investment up front before they start to “pay off”. • No framework can be all things to all people. If you are doing a specialized task, you may still have to work around the framework’s limitations.
References • A. Aho, M. Lam, R. Sethi, and J. Ullman, Compilers: Principles, Techniques and Tools, Addison Wesley, 2007, p. 656-659. • http://www.sable.mcgill.ca/soot/ • http://www.cs.cornell.edu/projects/polyglot/ • M. Eaddy, A. Aho, G. Antoniol, and Y. Guéhéneuc, "Cerberus: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis," International Conference on Program Comprehension (ICPC) (to appear), Amsterdam, The Netherlands, June 10-13, 2008.