Mining Specifications (lots of) code  specifications

Mining Specifications(lots of) code  specifications Glenn Ammons Ras Bodík Jim Larus Univ. of Wisconsin Univ. of Wisconsin Microsoft Research

Drivers wanted. Verification: beyond engine-less cars Recent successes. • specifications languages • checkers • abstractors What’s still missing? • specifications

So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet another language to learn? • specifications aren’t cool. Why they shouldn’t: • may misunderstand usage rules. • may not know all usage rules. Mining Specifications: • Convenience. • Like in data mining, discover surprise rules.

Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers resolved many problems: • incomplete system requirements. • incomplete API documentation. • implementation-dependent rules. • Want redundancy? (without redundant programming) • ask multiple programmers (and vote).

x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(y) close(x) Our output: a specification

How do we mine? • Underlying premise: • Even bad software is debugged enough to show hints of correct behavior. • Maxim: Common usage is the correct usage.

Mining = machine learning Reduce the problem into the well-known problem of learning regular languages. Obstacles: • bugs from source code may be learned into specification • what is “common” behavior? Solutions: • learn from dynamic behavior • learn probabilistically learn from traces into probabilistic FSMs

x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(y) close(x) Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0x400120, 16); listen(7, 5); 8 = accept(7, 0x400200, 0x400240); read(8, 0x400320, 255); write(8, 0x400320, 12); read(8, 0x400320, 255); write(8, 0x400320, 7); close(8); 10 = accept(7, 0x400200, 0x400240); read(10, 0x400320, 255); write(10, 0x400320, 13); close(10); close(7); … …

dynamicexecution(traces) usage scenarios(strings) (off-the-shelf)RegExp learner trace abstraction generalizedscenarios(probabilistic NFA) extract heavy core (and approve) specification (NFA) dynamic exe.to be checked (trace) dynamic checker OK/bug The mining algorithm

Trace abstraction: 4 challenges • Traces interleave useful and useless events. • RegExp learner cannot separate them. • Specifications must include both temporal and value-flow constraints. • RegExp learner only good with temporal constraints. • Only some of API calls’ arguments impose “true” dependences. • Infeasible to learn value-flow constraints on all arguments. • Specifications may impose only partial order. • Encoding all legal partial orders would produce a huge FSM.

h(_, 5) c(10) a(4, 5) d(4, 7) b(_, 5) f(10) h(_, 11) e(7) f(_) d(_, _) c(7) a(9, 11) b(_, 11) d(9, _) e(_) f(_) … h(_, ) a( , ) d( , ) b(_, ) e( ) h(_, X) a(Y, X) d(Y, Z) b(_, X) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) Trace abstraction h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11) e(7) f(50) d(15, 1) c(7) a(9, 11) b(6, 7) d(9, 14) f(20) e(7) …

Preliminary experiments Attempted to learn and verify two published X Windows rules As of Friday: • A timestamp-passing rule • learned the rule! (compact: 6 states) • bugs in 2 out of 17 programs (ups, e93) • SetOwner(x) must be followed by GetSelection(x) • failed to learn the rule (small learning set) but • bugs in 2 out of 5 programs (xemacs, ups)

Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us • eventually, we may need to include and learn some arithmetic relationships Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean] • software processes: [Cook and Wolf] • error checking: [Engler et al SOSP 2001] • lexical and syntactic pattern matching • user must write templates (e.g., <a> always follows <b>)

Ongoing work Mechanize tool. Find more gold.

Future work ESP Vault SPIN code Mining specifications bugs inputs Verisoft ? SLAM … Give gold to jewelers.

Summary • Semi-automatically creating well-formend, non-trivial specifications is an important part of the verification tool chain. • Contributions: • introduced specifications mining • phrased it as probabilistic learning from dynamic traces • decomposed it into a sequence of subproblems (using an off-the-shelf learner) • developed dynamic checker • found bugs

Discussion Expressibility • what classes of properties can/should we learn? • can we learn more than we can check? • can a single-threaded specification avoid race conditions?

Backup Slides

Mining Specifications (lots of) code  specifications

Mining Specifications (lots of) code  specifications

Presentation Transcript

Score Scanning Workshop

Proposed ENERGY STAR Specifications for Computer Monitors

Advanced Topics in Data Mining: Web Mining

Microbial Mining

Five Techniques for Better LabVIEW Code

4-bit Grey Code Converter with Counter

Establishing Impurity Specifications

Specification Techniques and Formal Specifications

Geo-Synthetics Specifications for Railway Sector

CSDP Preparation Course Module IV: Software Construction

Web Service Protocols —Web service specifications

Geometry EOC Item Specifications

DATA MINING: AN INTRODUCTION

Data Mining: Concepts and Techniques

10 Things You Can Do To Write Better Code 写好代码的十个秘诀

Web Mining

Link Mining

Chapter 1. Introduction

GASOLINE ENGINE OPERATION, PARTS, AND SPECIFICATIONS

Elementary Microarchitecture Algebra

Projects

Geometry EOC Item Specifications