420 likes | 527 Views
Using Coq to generate and reason about x86 systems code. Andrew Kennedy & Nick Benton (MSR Cambridge) Jonas Jensen (ITU Copenhagen). The big picture. Compositional specification and verification of high-level behavioural properties of low-level systems code
E N D
Using Coq to generate and reason about x86 systems code Andrew Kennedy & Nick Benton (MSR Cambridge)Jonas Jensen (ITU Copenhagen)
The big picture • Compositional specification and verification of high-level behavioural properties of low-level systems code • Previous work of Benton et al employed idealized machine code • Simple design • Infinite memory; pointers are natural numbers • It’s time to get real(ish): hence, x86
Overview of talk Modelling x86: bits, bytes, instructions, execution Generating x86: assembling & compiling Reasoning about x86: logic & proofs Discussion
Our approach • Clean slate: trusted base is just hardware and its model in Coq. † • No dependencies on legacy code, languages, compilers, or software architectures • Verify everything – including (at some point) loader-verifier • Do everything in Coq, making effective use of computation, notation, type classes, tactics, etc. • No dependencies on external tools • Coq as “world’s best macro assembler” • † And a small boot loader
Bits, bytes and words Compute here: n-tuples of bools Reason here: 'Z_(2^n) from ssreflect library,reuse lemmas • We want to compute correctly and efficiently inside Coq • Proper modelling of n-bit words, arithmetic with carry, sign, overflow, rotates, shifts, padding, the lot, all O(n) • Generic over word-length, so index type by n : nat • We also want to reason soundlyinside Coq • Associativity, commutativity, order properties, etc
Example: definition of addition Effective use of dependent types Performance inside Coq? On this machine, about 2000 additions a second Definition is very algorithmic:so we can compute!
Example: proofs about addition 2. Apply injectivity of toZp to work in 'Z_(2^n):forall x y, toZp x = toZp y -> x = y 1. Deal with n=0 case 3. Rewrite using homomorphism lemmas e.g. toZp (addBp1p2) = (toZpp1 + toZpp2)%R 4. Apply ssreflect “ring” lemma for 'Z_(2^n)
Machine state • Register state is just total function • Flags can take on undefined value (see later) • Abstractly, memory is DWORD BYTE • Partiality represents whether memory is mapped and accessible • Concretely, for efficiency, a trie-like structure
X86 instructions x86 is notoriously large and baroque (instruction set manual alone is 1640 pages long) Subset only: no legacy 16-bit mode, flat memory model (no segment nonsense), no floating point, no SIMD instructions, no protected-mode instructions, no 64-bit mode (yet) Actually: not too bad, possible to factor so that Coq datatype is “total” (no junk)
Addressing modes e.g. ADD EBX, EDI + [EDX*4] + 12
Instruction format Manuals don’t reveal much “structure” – such as it is – in instruction format But it can bediscerned – andutilitised forconcise decodingfunctions
Instruction decoding Uses monadic syntax,reader reads from memory and advances pointer Note: there may be many instruction formats for the same instruction
Instruction execution Example fragment: call and return Currently, a partial function from State to State. Implemented in monadic style, using “primitive” operations of r/w register, r/w flag, r/w memory, etc. Factored to re-use common patterns e.g. evalMemSpec, evalSrc
Representing non-determinism and under-specification • For sequential x86, for the subset we care about, almost completely deterministic • Flags are the main issue. • Introduce “undefined” state for flags • Instructions that depend on a flag whose value is undefined (e.g. branch-on-carry) then has unspecified behaviour • An alternative would be to set flags non-deterministically (cfRockSalt)
Instruction encoding Directly represent encoding by list of bytes Note: encoding is position-dependent In future we mightmirror decodingusing a monadic style
Jumps and labels Targets of jumps and branches are just absolute addresses in the Instr type. To write assembler code we want labels – for this we use a kind of HOAS type:
Syntax matters Label binding While macro Label Cute use of notation in Coq: can write assembler code more-or-less using syntax of real assemblers! But also make use of Coq definitions, and “macros”
Assembling Given an assembler program and an address to locate it, we can produce a sequence of bytes in the usual “two-pass” way:
Round-trip theorem Memory between offset and endpos contains bytes Memory between offset and endposdecodes to prog Statement of correctness uses overloaded “points-to” predicate, to be described later
Little languages Instead of trusting – or modelling – existing languages such as C, we plan to develop little languages inside Coq. We have experimented with a tiny imperative language and its “compiler”, proved correct in Coq
Big picture Assertion logic: predicate on partial states, usual connectives + separating conjunction Specification logic over this, incorporates step-indexing and framing, with corresponding later and frame connectives Safety specification used to give rules for instructions, in CPS style, packaged as Hoare-style triples for non-jumpy instructions Treatment of labels makes for elegant definition and rules for macros (e.g. while, if)
Partial states • Partiality denotes partial description, as usual for separation logic • Not to be confused with use of partiality for flags (undefined state) and memory (un-mapped or inaccessible)
Assertion logic • We define a separation logic of assertions, with usual connectives. Example rules: • Points-to predicate for memory is overloaded for different “decoders” of memory x could be a BYTE, a DWORD, a seq BYTE or even an Instr Core definition: memory from p to q “decodes” to value x Assertions (= SPred) are predicates on partial states
Safety • Example: tight loop • Example: jmp Machine code does not “finish” and so standard Hoare triple does not suit; also, code is mixed up with store. So we define safe k P to mean “runs without faulting for k steps from any state satisfying P.”
Specification logic It’s painful working directly with safe: we must work explicitly with “step-index” k and “frame” R Instead, we define a specification logic in which a spec is a set S of pairs such that In other words, it builds in steps and frames
Connectives for spec logic • We define a frame connective • It gives us a “frame rule” for specs, and distributes over other connectives To hide explicit step indices, we use a later connective and the Löb rule:
Basic blocks • We can then derive familiar rules such as framing: • This is useful when proving straight-line machine code Given our definitions of safety and points-to for instructions, we can mimic Hoare-style triples for basic blocks:
Rules for instructions (I)No control flow Use Hoare-like triple
Rules for instructions (II)Control flow Two possible continuations Explicit CPS-like use of safe
Reasoning with labels We overload “points-to” on assembler programs, so (roughly)
Macros Our representation of scoped labels makes it easy to define macros that make use of labels internally – and derive rules for them.
Proof support • Very painful to work with assertions and specs using only primitive rules • We have built Coq tactic support for • Basic simplification of formulae (AC of *, etc.) • Pulling out existential quantifiers automatically • Greatly simplifies proving!
Status • We can generate and prove correct tiny programs written in “Coq” assembler and a small while-language • Binary generated by Coq can be run on “raw metal” (booted off a CD!) • Next steps • Model of I/O e.g. screen/keyboard; currently our “observable” is just “faulting” • High-level model of processes • Build and verify OS components such as scheduler, allocator, loaded • Eventual aim: process isolation theorem