390 likes | 518 Views
Relevance Heuristics for Program Analysis. Ken McMillan Cadence Research Labs. TexPoint fonts used in EMF: A A A A A. Introduction. Program Analysis Based on abstract interpretation Useful tool for optimization and verification Strong tension between precision and cost
E N D
Relevance Heuristicsfor Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: AAAAA
Introduction • Program Analysis • Based on abstract interpretation • Useful tool for optimization and verification • Strong tension between precision and cost • Relevance heuristics • Tailor abstract domain to property • Key to scaling while maintaining enough information to prove useful properties • This talk • General principles underlying relevance heuristics • Applying these ideas to program analysis using Craig interpolation • Some recent research on analysis of heap manipulating programs
Static Analysis • Compute the least fixed-point of an abstract transformer • This is the strongest inductive invariant the analysis can provide • Inexpensive analyses: • interval analysis • affine equalities, etc. • These analyses lose information at a merge: x = y x = z T This analysis is inexpensive, but insufficient if the disjunction is needed to prove the desired property
Predicate abstraction • Abstract transformer: • strongest Boolean postcondition over given predicates • Advantage: does not lose information at a merge • join is disjunction x = y x = z x=y _ x=z • Disadvantage: • Abstract state is exponential size in number of predicates • Abstract domain has exponential height • Result: • Must use only predicates relevant to proving property
Relevance Heuristics • Iterative refinement approach • Analyze failure of abstraction to prove property • Typically use failed program traces (CEGAR) • Add relevant information to abstraction • Must be sufficient to rule out failure • Key questions • How do we decide what program state information is “relevant”? • Is relevance even a well defined notion? These questions have been well studied in the context of the Boolean satisfiability problem, and we can actually give some fairly concrete answers.
Principles • Relevance: • A relevant predicate is one that is used in a parsimonious proof of the desired property • Generalization principle: • Facts used in the proof of special cases tend to be relevant to the overall proof.
Resolution rule: p _ : p _D _ Relevance principles and SAT • The Boolean Satisfiability Problem (SAT) • Input: A Boolean formula in CNF • Output: A satisfying assignment or UNSAT • The DPLL approach: • Branch. (assign values to variables) • Propagate. (make deductions by unit resolution, or BCP) • Learn. (deduce new clauses in response to conflicts)
resolve (Øb Ú c ) a b Conflict! Learned clause d Decisions DPLL approach (Øa Ú b) Ù (Øb Ú c Ú d) Ù (Øb ÚØ d) Øc • BCP guides clause learning by resolution • Learning generalizes failures • Learning guides decisions (VSIDS)
Two kinds of deduction • Closing this loop focuses solver on relevant deductions • Allows SAT solvers to handle millions of clauses • Generates parsimonious proofs in case of unsatisfiability • What lessons can we learn from this architecture for program analysis? Case Splits • case-based • lightweight • exhaustive Propagation Generalization • general • guided
invariant: {x == y} Invariants from unwindings • Consider this very simple approach: • Partially unwind a program into a loop-free, in-line program • Construct a Floyd/Hoare proof for the in-line program • See if this proof contains an inductive invariant proving the property • Example program: x = y = 0; while(*) x++; y++; while(x != 0) x--; y--; assert (y == 0);
{True} {True} x = y = 0; x++; y++; x++; y++; [x!=0]; x--; y--; [x!=0]; x--; y--; [x == 0] [y != 0] {y = 0} {x = 0 ^ y = 0} {y = 1} {x = y} {y = 2} {x = y} Proof of inline program contains invariants for both loops {y = 1} {x = y} {y = 0} {x = 0 ) y = 0} {False} {False} Unwind the loops • Assertions may diverge as we unwind • A practical method must somehow prevent this kind of divergence!
Interpolation Lemma [Craig,57] • Notation: L() is the set of FO formulas using • the ininterpreted symbols of (predicates and functions) • the logical symbols ^, _, :, 9, 8, (), ... • If A Ù B = false, there exists an interpolant A' for (A,B) such that: A Þ A' A' ^ B = false A' 2L(A) \ L(B) • Example: • A = p Ù q, B = Øq Ù r, A' = q
... A1 A2 A3 An True False ... ) ) ) ) A'1 A'2 A'3 A‘n-1 Interpolants for sequences • Let A1...An be a sequence of formulas • A sequence A’0...A’n is an interpolant for A1...An when • A’0 = True • A’i-1^ Ai) A’i, for i = 1..n • An = False • and finally, A’i2L (A1...Ai) \L(Ai+1...An) In other words, the interpolant is a structured refutation of A1...An
{True} True x = y x1= y0 x=y; 1. Each formula implies the next ) x1=y0 {x=y} y1=y0+1 y++ y++; ) {y>x} y1>x1 x1=y1 [x == y] [x=y] ) False {False} Proving in-line programs proof SSA sequence Hoare Proof Prover Interpolation Interpolants as Floyd-Hoare proofs 2. Each is over common symbols of prefix and suffix 3. Begins with true, ends with false
FOCI: An Interpolating Prover • Proof-generating decision procedure for quantifier-free FOL • Equality with uninterpreted function symbols • Theory of arrays • Linear rational arithmetic, integer difference bounds • SAT Modulo Theories approach • Boolean reasoning performed by SAT solver • Exploits SAT relevance heuristics • Quantifier-free interpolants from proofs • Linear-time construction [TACAS 04] • From Q-F interpolants, we can derive atomic predicates for Predicate Abstraction [Henzinger, et al, POPL 04] • Allows counterexample-based refinement • Integrated with software verification tools • Berkeley BLAST, Cadence IMPACT
But won’t we diverge? • Programs are infinite state, so convergence to a fixed point is not guaranteed. • What would prevent us from computing an infinite sequence of interpolants, say, x=0, x=1, x=2,... as we unwind the loops further? • Limited completeness result • Stratify the logical language L into a hierarchy of finite languages • Compute minimal interpolants in this hierarchy • If an inductive invariant proving the property exists in L, you must eventually converge to one Interpolation provides a means of static analysis in abstract domains of infinite height. Though we cannot compute a least fixed point, we can compute a fixed point implying a given property if one exists.
Experiments Windows DDK CAV 06 POPL 04 * Pre-processed
Relevance heuristics • Relevance heuristics are key to managing the precision/cost tradeoff • In general, less information is better • Effective relevance heuristics improve scaling behavior • Based on principle of generalization from special cases • Interpolation approach • Yields Floyd-Hoare proofs for loop-free program fragments • Provides an effective relevance heuristic • if we can solve the divergence problem • Exploits prover’s ability to focus on a small set of relevant facts
Expressiveness hierarchy Canonical Heap Abstractions 8FO(TC) Indexed Predicate Abstraction 8FO Expressiveness Predicate Abstraction QF Interpolant Language Parameterized Abstract Domain
invariant: {8 x. 0 · x ^ x < i ) a[x] = x} Need for quantified interpolants • Existing interpolating provers cannot produce quantified interpolants • Problem: how to prevent the number of quantifiers from diverging in the same way that constants diverge when we unwind the loops? for(i = 0; i < N; i++) a[i] = i; for(j = 0; j < N; j++) assert a[j] = j;
Need for Reachability • This condition needed to prove memory safety (no use after free). • Cannot be expressed in FO • We need some predicate identifying a closed set of nodes that is allocated • We require a theory of reachability (in effect, transitive closure) ... node *a = create_list(); while(a){ assert(alloc(a)); a = a->next; } ... invariant: 8 x (rea(next,a,x) ^ x nil ! alloc(x)) Can we build an interpolating prover for full FOL than that handles reachability, and avoids divergence?
Clausal provers • A clausal refutation prover takes a set of clauses and returns a proof of unsatisfiability (i.e., a refutation) if possible. • A prover is based on inference rules of this form: P1 ... Pn C • where P1 ... Pn are the premises and C the conclusion. • A typical inference rule is resolution, of which this is an instance: p(a) p(U) ! q(U) q(a) • This was accomplished by unifying p(a) and P(U), then dropping the complementary literals.
Superposition calculus Modern FOL provers based on the superposition calculus • example superposition inference: Q(a) P ! (a = c) P ! Q(c) • this is just substitution of equals for equals • in practice this approach generates a lot of substitutions! • use reduction order to reduce number of inferences
Reduction orders • A reduction order  is: • a total, well founded order on ground terms • subterm property: f(a)  a • monotonicity: a  b implies f(a)  f(b) • Example: Recursive Path Ordering (with Status) (RPOS) • start with a precedence on symbols: a  b  c  f • induces a reduction ordering on ground terms: f(f(a)  f(a)  a  f(b)  b  c  f
These terms must be maximal in their clauses Thm: Superposition with OC is complete for refutation in FOL with equality. So how do we get interpolants from these proofs? Ordering Constraint • Constrains rewrites to be “downward” in the reduction order: Q(a) P ! (a = c) P ! Q(c) example: this inference only possible if a  c
Local Proofs • A proof is local for a pair of clause sets (A,B) when every inference step uses only symbols from A or only symbols from B. • From a local refutation of (A,B), we can derive an interpolant for (A,B) in linear time. • This interpolant is a Boolean combination of formulas in the proof
A B x = y f(y) = d f(x) = c c d Reduction orders and locality • A reduction order is oriented for (A,B) when: • s  t for every s L (B) and t 2L(B) • Intuition: rewriting eliminates first A variables, then B variables. oriented: x y c d f x = y f(x) = c ` f(y) = c Local!! f(y) = c f(y) = d ` c = d c = d c d `?
Q(a) a = c Q(a) Q(c) a = c a = U ! Q(U) Q(c) Orientation is not enough • Local superposition gives only c=c. • Solution: replace non-local superposition with two inferences: B A Q(a) a = c Q  a  b  c b = c : Q(b) Second inference can be postponed until after resolving with : Q(b) This “procrastination” step is an example of a reduction rule, and preserves completeness.
Completeness of local inference • Thm: Local superposition with procrastination is complete for refutation of pairs (A,B) such that: • (A,B) has a universally quantified interpolant • The reduction order is oriented for (A,B) • This gives us a complete method for generation of universally quantified interpolants for arbitrary first-order formulas! • This is easily extensible to interpolants for sequences of formulas, hence we can use the method to generate Floyd/Hoare proofs for inline programs.
Avoiding Divergence • As argued earlier, we still need to prevent interpolants from diverging as we unwind the program further. • Idea: stratify the clause language Example: Let Lk be the set of clauses with at most k variables and nesting depth at most k. Note that each Lk is a finite language. • Stratified saturation prover: • Initially let k = 1 • Restrict prover to generate only clauses in Lk • When prover saturates, increase k by one and continue The stratified prover is complete, since every proof is contained in some Lk.
Completeness for universal invariants • Lemma: For every safety program M with a 8 safety invariant, and every stratified saturation prover P, there exists an integer k such that P refutes every unwinding of M in Lk, provided: • The reduction ordering is oriented properly • This means that as we unwind further, eventually all the interpolants are contained in Lk, for some k. • Theorem: Under the above conditions, there is some unwinding of M for which the interpolants generated by P contain a safety invariant for M. This means we have a complete procedure for finding universally quantified safety invariants whenever these exist!
In practice • We have proved theoretical convergence. But does the procedure converge in practice in a reasonable time? • Modify SPASS, an efficient superposition-based saturation prover: • Generate oriented precedence orders • Add procrastination rule to SPASS’s reduction rules • Drop all non-local inferences • Add stratification (SPASS already has something similar) • Add axiomatizations of the necessary theories • An advantage of a full FOL prover is we can add axioms! • As argued earlier, we need a theory of arrays and reachability (TC)
Partially Axiomatizing FO(TC) • Axioms of the theory of arrays (with select and store) 8 (A, I, V) (select(update(A,I,V), I) = V 8 (A,I,J,V) (I J ! select(update(A,I,V), J) = select(A,J)) • Axioms for reachability (rea) 8 (L,E) rea(L,E,E) 8 (L,E,X) (rea(L,select(L,E),X) ! rea(L,E,X)) [ if e->link reaches x then e reaches x] 8 (L,E,X) (rea(L,E,X) ! E = X _ rea(L,select(L,E),X)) [ if e reaches x then e = x or e->link reaches x] etc... Since FO(TC) is incomplete, these axioms must be incomplete
invariant: {8 x. 0 · x ^ x < i ) a[x] = x} Simple example for(i = 0; i < N; i++) a[i] = i; for(j = 0; j < N; j++) assert a[j] = j;
i0 = 0 i0 < N a1 = update(a0,i0,i0) i1 = i0 + 1 i1 < N a2 = update(a1,i1,i1) i2 = i+1 + 1 i ¸ N ^ j0 = 0 j0 < N ^ j1 = j0 + 1 j1 < N select(a2,j1) j1 {i0 = 0} i = 0; [i < N]; a[i] = i; i++; [i < N]; a[i] = i; i++; [i >= N]; j = 0; [j < N]; j++; [j < N]; a[j] != j; invariant {0 · U ^ U < i1) select(a1,U)=U} {0 · U ^ U < i2) select(a2,U)=U} invariant {j · U ^ U < N ) select(a2,U)=U} {j · U ^ U < N ) select(a2,U) = U} Unwinding simple example • Unwind the loops twice note: stratification prevents constants diverging as 0, succ(0), succ(succ(0)), ...
List deletion example • Invariant synthesized with 3 unwindings (after some: simplification): a = create_list(); while(a){ tmp = a->next; free(a); a = tmp; } {rea(next,a,nil) ^ 8 x (rea(next,a,x)! x = nil _ alloc(x))} • That is, a is acyclic, and every cell is allocated • Note that interpolation can synthesize Boolean structure.
More small examples This shows that divergence can be controlled. But can we scale to large programs?...
Conclusion • Relevance heuristics are essential for scaling richer program analysis domains to large programs • Relevance heuristics are based on a generalization principle: • Relevant facts are those used in parsimonious proofs • Facts relevant to special cases are likely to be useful in the general case • Relevance heuristics for program analysis • Special cases can be program paths or loop-free unwindings • Interpolation can extract relevant facts from proofs of special cases • Must avoid divergence • Quantified invariants • Needed for programs that manipulating arrays or heaps • FO equality prover modified to produce local proofs (hence interpolants) • Complete for universal invariants • May be used as a relevance heuristic for shape analysis, IPA