Context-Sensitive, Interprocedural Dataflow Analysis as CFL Reachability

Context-Sensitive, Interprocedural Dataflow Analysis as CFL Reachability Seth Hallem and Eric Watkins

Exhaustive Analysis Papers • “Precise Interprocedural Dataflow Analysis via Graph Reachability” • Reps, Horowitz, Sagiv -- POPL 1995 • applies CFL reachability to context-sensitive, interprocedural dataflow analysis • “Program Analysis via Graph Reachability” • Reps -- ILP 1997 • describes two additional applications: interprocedural program slicing and shape analysis

The Reduction to CFL Reachability • Question 1: What problems can we solve? • Question 2: How do we set up the problem? • Question 3: How do we solve the problem? • Question 4: What is the complexity of this approach? • Running example: possibly uninitialized variables

What problems can we solve? • IFDS problems • Finite set of dataflow facts (D) • Mapping from functions ƒ:2D2D to edges in the CFG • Each ƒ is distributive wrt the meet operator: • ƒ(a b) = ƒ(a) ƒ(b) • Possibly uninitialized vars: • Each program variable corresponds to a dataflow fact. When that fact holds, the variable may be uninitialized. • Transfer functions: a variable is uninitialized if it was just declared or if it is assigned an expression containing uninitialized variables.

Simple Example int z; int main (void) { int x ,y = 0; /* {x, z} */ y = y + x; /* {x, y, z} */ z = 0; /* {x, y} */ } • D = {x, y, z}, domain/range of transfer functions is the power set of D (2D)

How do we setup and solve IFDS problems? • Inputs to the algorithm: • Exploded supergraph (next couple of slides) • Outputs from the algorithm: • meet-over-all-realizable-paths solution: • MRPn = pfq( ) qRpaths (startmain, n)

The Supergraph

Representation Relations • Each dataflow function, ƒ, is converted to a representation relation, which is represented as a graph consisting of 2D + 2 nodes • D input nodes, one for each dataflow fact, plus the node  (or 0), which corresponds to the empty set. • D output nodes plus the node  • There is an edge from input node d1 to output node d2 if d2 ƒ(S) if d1S and d2 ƒ()

More Representation Relations • (a) and (b) show representation relations for two functions (nodes smain and n1) • (c) and (d) show two ways to compose these relations • (d) illustrates the need for the  in each relation

Exploding the Supergraph

CFL Reachability • Want to solve the dataflow problem with a reachability query on the exploded supergraph. • Not all paths in G# are valid, though. Must match calls w/returns. • Insight: context-sensitivity = matching parens; language of matching parens is a CFL

Context-Sensitivity = CFL • Assign a unique index to each callsite, define a CFL of matching calls and returns. • Suppose we have two call-sites to function P(), which we label i and k • (i (k )k )i is a valid path • (i (k )k is a valid path • (i (k )i is not

Reachability Algorithm • Dynamic programming is the key • Start at the entry point to the program. Follow the edges in G#, recording what dataflow facts we can reach. • At a procedure call, follow the call. To avoid re-doing any work, though, maintain a cache of edges of that summarize pieces of the computation. • Summary edges record the results of an entire procedure, start at a callsite, end at the corresponding return-site. • Path edges record the suffix of a valid path.

Dynamic Programming Details

Complexity • Worst case for general CFL reachability is cubic in the number of nodes in the graph • Can do better for dataflow analysis: O(ED3) for any distributive problem, O(Call D3 + hED2) for h-sparse problems • possibly uninitialized variables is 2-sparse when aliasing is ignored: a variable’s status as initialized or uninitialized can only affect itself and one other variable (if it is assigned to that variable)

Other Applications • Interprocedural slicing • identify all pieces of a program relevant to a particular statement • Shape Analysis • For any DAG data structure, determines a superset of the possible shapes for that data structure. • Each dataflow fact corresponds to a single possible shape. • Problem: infinite number of shapes. Solution is to define shape at program point q in terms of shape at previous program points. • ILP paper has an example of shape analysis of a linked list.

The other papers • “Demand Interprocedural Dataflow Analysis” • Horowitz, Reps, Sagiv -- FSE 1995 • “Demand-driven Computation of Interprocedural Data Flow” • Duesterwald, Gupta, Soffa -- POPL 1995 • Provide two possible frameworks for transforming any IFDS analysis into a demand-driven analysis

Steps to Demand-driven analysis • Define problem in the IFDS framework • Reverse the flow functions, or reverse the flow edges • Start with initial query < d, n > • Propagate the query backwards until solved

Reversing dataflow • In Duesterwald et al., the dataflow problem is specified with flow functions • Reverse the functions • For CFL problems, the problem is represented as a set of edges • Just reverse the edges

Example: CCP Notation • x – set of dataflow facts • xw – dataflow fact for variable w • fn(x)w – transfer fn for variable w at node n • [w = c] – set of dataflow facts, where the fact for variable w equals c

Query Algorithm • Worklist holds the set of outstanding queries • While not empty, remove a query • Propagate backwards one node in the flowgraph • For a function call, create a backwards summary for that function and apply that

Query Propagation More notation • rp – entry node for procedure p • m, n – normal nodes • fm – reverse dataflow fn for node m • Ncall – all nodes that are callsites • call(m) – the procedure called at node m • f(rp, ep) – summary fn for procedure p

Backwards edge propagation

Query Algorithm Efficiency • Optimizations: function summaries, early termination, query result cache • In the worst case, it’s the same as exhaustive analysis • Some problems work better than others for demand-driven analysis. • Depends how much information you need to answer queries, or how many queries need to be made.

Conclusions • Demand-driven analysis is a powerful idea • Saves time and space, but in the worst case it’s no better than exhaustive analysis • Only works for distributive problems • Two approaches for demand-driven analysis are equivalent

Discussion • Are these algorithms generally applicable? • Are they fast? • No evidence the papers, but the answer is yes (see ESP in a couple of weeks) • Why are they efficient (beyond the complexity guarantee)? • Is it always cheap to compute the exploded supergraph? • How can an imprecise alias analysis influence this step and the overall performance of the algorithm?

Context-Sensitive, Interprocedural Dataflow Analysis as CFL Reachability