1 / 18

Mining Specifications (lots of) code  specifications

Mining Specifications (lots of) code  specifications. Glenn Ammons Ras Bodík Jim Larus Univ. of Wisconsin Univ. of Wisconsin Microsoft Research. Drivers wanted. Verification: beyond engine-less cars. Recent successes.

minowa
Download Presentation

Mining Specifications (lots of) code  specifications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Specifications(lots of) code  specifications Glenn Ammons Ras Bodík Jim Larus Univ. of Wisconsin Univ. of Wisconsin Microsoft Research

  2. Drivers wanted. Verification: beyond engine-less cars Recent successes. • specifications languages • checkers • abstractors What’s still missing? • specifications

  3. So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet another language to learn? • specifications aren’t cool. Why they shouldn’t: • may misunderstand usage rules. • may not know all usage rules. Mining Specifications: • Convenience. • Like in data mining, discover surprise rules.

  4. Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers resolved many problems: • incomplete system requirements. • incomplete API documentation. • implementation-dependent rules. • Want redundancy? (without redundant programming) • ask multiple programmers (and vote).

  5. x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(y) close(x) Our output: a specification

  6. How do we mine? • Underlying premise: • Even bad software is debugged enough to show hints of correct behavior. • Maxim: Common usage is the correct usage.

  7. Mining = machine learning Reduce the problem into the well-known problem of learning regular languages. Obstacles: • bugs from source code may be learned into specification • what is “common” behavior? Solutions: • learn from dynamic behavior • learn probabilistically learn from traces into probabilistic FSMs

  8. x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(y) close(x) Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0x400120, 16); listen(7, 5); 8 = accept(7, 0x400200, 0x400240); read(8, 0x400320, 255); write(8, 0x400320, 12); read(8, 0x400320, 255); write(8, 0x400320, 7); close(8); 10 = accept(7, 0x400200, 0x400240); read(10, 0x400320, 255); write(10, 0x400320, 13); close(10); close(7); … …

  9. dynamicexecution(traces) usage scenarios(strings) (off-the-shelf)RegExp learner trace abstraction generalizedscenarios(probabilistic NFA) extract heavy core (and approve) specification (NFA) dynamic exe.to be checked (trace) dynamic checker OK/bug The mining algorithm

  10. Trace abstraction: 4 challenges • Traces interleave useful and useless events. • RegExp learner cannot separate them. • Specifications must include both temporal and value-flow constraints. • RegExp learner only good with temporal constraints. • Only some of API calls’ arguments impose “true” dependences. • Infeasible to learn value-flow constraints on all arguments. • Specifications may impose only partial order. • Encoding all legal partial orders would produce a huge FSM.

  11. h(_, 5) c(10) a(4, 5) d(4, 7) b(_, 5) f(10) h(_, 11) e(7) f(_) d(_, _) c(7) a(9, 11) b(_, 11) d(9, _) e(_) f(_) … h(_, ) a( , ) d( , ) b(_, ) e( ) h(_, X) a(Y, X) d(Y, Z) b(_, X) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) Trace abstraction h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11) e(7) f(50) d(15, 1) c(7) a(9, 11) b(6, 7) d(9, 14) f(20) e(7) …

  12. Preliminary experiments Attempted to learn and verify two published X Windows rules As of Friday: • A timestamp-passing rule • learned the rule! (compact: 6 states) • bugs in 2 out of 17 programs (ups, e93) • SetOwner(x) must be followed by GetSelection(x) • failed to learn the rule (small learning set) but • bugs in 2 out of 5 programs (xemacs, ups)

  13. Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us • eventually, we may need to include and learn some arithmetic relationships Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean] • software processes: [Cook and Wolf] • error checking: [Engler et al SOSP 2001] • lexical and syntactic pattern matching • user must write templates (e.g., <a> always follows <b>)

  14. Ongoing work Mechanize tool. Find more gold.

  15. Future work ESP Vault SPIN code Mining specifications bugs inputs Verisoft ? SLAM … Give gold to jewelers.

  16. Summary • Semi-automatically creating well-formend, non-trivial specifications is an important part of the verification tool chain. • Contributions: • introduced specifications mining • phrased it as probabilistic learning from dynamic traces • decomposed it into a sequence of subproblems (using an off-the-shelf learner) • developed dynamic checker • found bugs

  17. Discussion Expressibility • what classes of properties can/should we learn? • can we learn more than we can check? • can a single-threaded specification avoid race conditions?

  18. Backup Slides

More Related