190 likes | 327 Views
Mining Specifications (lots of) code specifications. Glenn Ammons Ras Bodík Jim Larus Univ. of Wisconsin Univ. of Wisconsin Microsoft Research. Drivers wanted. Verification: beyond engine-less cars. Recent successes.
E N D
Mining Specifications(lots of) code specifications Glenn Ammons Ras Bodík Jim Larus Univ. of Wisconsin Univ. of Wisconsin Microsoft Research
Drivers wanted. Verification: beyond engine-less cars Recent successes. • specifications languages • checkers • abstractors What’s still missing? • specifications
So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet another language to learn? • specifications aren’t cool. Why they shouldn’t: • may misunderstand usage rules. • may not know all usage rules. Mining Specifications: • Convenience. • Like in data mining, discover surprise rules.
Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers resolved many problems: • incomplete system requirements. • incomplete API documentation. • implementation-dependent rules. • Want redundancy? (without redundant programming) • ask multiple programmers (and vote).
x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(y) close(x) Our output: a specification
How do we mine? • Underlying premise: • Even bad software is debugged enough to show hints of correct behavior. • Maxim: Common usage is the correct usage.
Mining = machine learning Reduce the problem into the well-known problem of learning regular languages. Obstacles: • bugs from source code may be learned into specification • what is “common” behavior? Solutions: • learn from dynamic behavior • learn probabilistically learn from traces into probabilistic FSMs
x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(y) close(x) Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0x400120, 16); listen(7, 5); 8 = accept(7, 0x400200, 0x400240); read(8, 0x400320, 255); write(8, 0x400320, 12); read(8, 0x400320, 255); write(8, 0x400320, 7); close(8); 10 = accept(7, 0x400200, 0x400240); read(10, 0x400320, 255); write(10, 0x400320, 13); close(10); close(7); … …
dynamicexecution(traces) usage scenarios(strings) (off-the-shelf)RegExp learner trace abstraction generalizedscenarios(probabilistic NFA) extract heavy core (and approve) specification (NFA) dynamic exe.to be checked (trace) dynamic checker OK/bug The mining algorithm
Trace abstraction: 4 challenges • Traces interleave useful and useless events. • RegExp learner cannot separate them. • Specifications must include both temporal and value-flow constraints. • RegExp learner only good with temporal constraints. • Only some of API calls’ arguments impose “true” dependences. • Infeasible to learn value-flow constraints on all arguments. • Specifications may impose only partial order. • Encoding all legal partial orders would produce a huge FSM.
h(_, 5) c(10) a(4, 5) d(4, 7) b(_, 5) f(10) h(_, 11) e(7) f(_) d(_, _) c(7) a(9, 11) b(_, 11) d(9, _) e(_) f(_) … h(_, ) a( , ) d( , ) b(_, ) e( ) h(_, X) a(Y, X) d(Y, Z) b(_, X) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) Trace abstraction h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11) e(7) f(50) d(15, 1) c(7) a(9, 11) b(6, 7) d(9, 14) f(20) e(7) …
Preliminary experiments Attempted to learn and verify two published X Windows rules As of Friday: • A timestamp-passing rule • learned the rule! (compact: 6 states) • bugs in 2 out of 17 programs (ups, e93) • SetOwner(x) must be followed by GetSelection(x) • failed to learn the rule (small learning set) but • bugs in 2 out of 5 programs (xemacs, ups)
Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us • eventually, we may need to include and learn some arithmetic relationships Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean] • software processes: [Cook and Wolf] • error checking: [Engler et al SOSP 2001] • lexical and syntactic pattern matching • user must write templates (e.g., <a> always follows <b>)
Ongoing work Mechanize tool. Find more gold.
Future work ESP Vault SPIN code Mining specifications bugs inputs Verisoft ? SLAM … Give gold to jewelers.
Summary • Semi-automatically creating well-formend, non-trivial specifications is an important part of the verification tool chain. • Contributions: • introduced specifications mining • phrased it as probabilistic learning from dynamic traces • decomposed it into a sequence of subproblems (using an off-the-shelf learner) • developed dynamic checker • found bugs
Discussion Expressibility • what classes of properties can/should we learn? • can we learn more than we can check? • can a single-threaded specification avoid race conditions?