1) DSM Protocol Synthesis 2) Verification of Memory Orderings the Utah Verifier group

1) DSM Protocol Synthesis2) Verification of Memory Orderingsthe Utah Verifier group Ganesh Gopalakrishnan Ratan Nalumasu Ravi Hosabettu Rajnish Ghughal Michael Jones Abdel Mokkedem Paliath Narendran (Visitor) Department of Computer Science, University of Utah

Our long-term goals • Application of FM during design, aided by • abstractions: • State-space reduction techniques (PO, …) • Automatic discovery of finite instantiations • specializations: • Special handling of queues • Specialized model-checkers (e.g. for memory models) • Support for requirements capture & verification • structural constraints via dependent typing • behavioral aspects thru high-level FSMs • Support for synthesis (protocol, and cycle-level) • Realistic case-studies

Our Project: Utah Verifier

Protocol Refinement Ratan Nalumasu Ganesh Gopalakrishnan (Submitted)

Motivations • Distributed directory based coherence protocols written “in the usual way” are difficult to understand and debug • low-level requests / acks / nacks don’t reveal *what* is being implemented • transient states are introduced and handled in an ad-hoc way • buffer allocation is not tied to desired high-level properties (e.g. progress) • verification is tedious due to low level of written protocols; obviated by change of system parameters (e.g. queue sizes)

Another Req ? ? ? Usually don’t know what to say…. ..saying nothing causes deadlock! Req Ack Example of problems due to “unexpected msgs” Cache Ctrlr Directory Ctrlr

Our approach • Based on synthesis • transient states introduced automatically, as needed • buffer allocation is tied to desired high-level properties (e.g. progress), and can be handled systematically • verification becomes much easier • synthesized protocols seem efficient (performance simulation in progress; # of protocol msgs hand-written)

Overview of Synthesis Method Cache Ctrlr I I E E Req (N)ack Dir Ctrlr F E F E

Model-checking Efficiency

An Illustration: Migratory Protocol (i) Process ‘h’ r(j)?req r(o)!inv r(i)!gr(data) r(i)?req F E I1 I2 r(o)?LR(data) r(o)?LR(data) r(j)!gr(data) r(o)?ID(data) Process ‘r(i)’ I3 V1 h!LR(data) evict h!req I rw h?gr(data) V h!ID(data) h?inv V2

An Illustration: Migratory Protocol (ii) Process ‘h’ r(j)?req r(o)!inv r(i)!gr(data) r(i)?req F E I1 I2 r(o)?LR(data) r(o)?LR(data) r(j)!gr(data) r(o)?ID(data) Process ‘r(i)’ I3 V1 h!LR(data) evict h!req I rw h?gr(data) V h!ID(data) h?inv V2

R?y A Generic Example P Q R P?x R!b Q!c Q!a

P?x Q!c R?y Async Implementation of Example (i) P Q R R!b Q!a Q!!a R!!b 1 msg buffer location for Ack/Nack

P?x Q!c R?y Async Implementation of Example (ii) P Q R R!b Q!a Progress Buffer Q!!c Q!!a R!!b P!!ack

Organization of Protocol - per Cache Line Remote Nodes Home Node - Remote nodes (cache ctrlrs) communicate w. home directory controller only - If Remote and Home requests cross in medium, . Remote request treated as Nack by Home . Home request is dropped by Remote - Pt-to-pt order-preserving error-free communication

General Nature of Communication States h?m2 T h!msg h?m1 (Remote) r(j)!m2 T r(i)?m1 (Home)

Summary: Remote node rules Protocol Refinement

Summary: Home node (i) Protocol Refinement

Summary: Home node (ii) Protocol Refinement

Status of Work • Correctness of Protocol Synthesis Proved in PVS • Qa ->a Qa’ => abs(Qa) = abs(Qa’) • \/ abs(Qa) ->r abs(Qa’) • Write-invalidate protocol also synthesized • Work in progress • Implementation of refinement technique • Making it more general (priority will be defined per message type pair; home can also nack remote) • Performance evaluation in progress • Offers a general synthesis method for protocols (not necessarily for DSM) • Related work: Buckley and Silberschatz, Chandra et.al., Park and Dill, Gribomont, ...

Verification of Memory Orderings (Work in progress)

A Motivating Example P1 read(a2) write(a1,d1) P1 write(a1,d1) read(a2) Uniprocessors: ok P1 write(a1,d1) read(a2) P2 write(a2,d2) read(a1) Biprocessors: Not under S.C. P1 read(a2) write(a1,d1) P2 read(a1) write(a2,d2) Uniprocessor optimizations don’t apply to multiprocessors..formalized by ordering rules (aka formal memory models)

Importance • HW designers desire early feedback on what memory orderings are implied by various design decisions: • Coherence protocols used in busses • CPU ordering rules • Can also arise in “non multiprocessor situations” • Present approaches: • Theorem-proving based • By applying model-checking techniques • require formulation of desired mem model in temporal logic formulae that can get really ugly - even for toy problems

Collier’s work • Formal Definitions for memory ordering rules • Architectural tests to establish that the tests when “run sufficiently long” and violated => one of the ordering rules being tested is violated • Tests are strong enough to seem to imply that when the checks pass the arch. Rules are indeed obeyed (need to prove this) • Program ARCHTEST embodies these tests - available under free license to schools...

Our work • Adapt Collier’s programs for testing machines for memory ordering rules to model-checking • Tests are known for many memory ordering rules • Tests independent of system being verified • Natural to apply in a design situation • Can test HDL models of architecture under design • Can handle very detailed designs involving CPU ordering rules, split transaction SMP busses, pipelined arbitration, pre-fetching ownership of cache lines, multiple outstanding transactions, … Our model is such, written in VIS Verilog • Memory ordering rule(s) to be verified are not case-specific temporal formulae - they are fixed FSMs! • Offers model-checkable sufficient conditions for machine to obey desired orderings

An Example of Collier’s Test for (CMP,RO,WO): CMP: Computational ordering (captures how individual machine instructions execute in, and across machines e.g. P1 P2 a := 1 --- b := a+1 result: a=0, b = 2 is a cmp violation…. RO (WO): Reads (writes) in a single process are completed in program order.

An Example of Collier’s Test for (CMP,RO,WO): P1: P2: A := 1 X[1] := A A := 2 X[2] := A A := 3 X[3] := A … … A := k X[k] := A Check that every X[i] is in 1..k and monotonically increasing. Run for all possible “k” to get all randomization effects taken care of.

Why this tests for (CMP,RO,WO): Consider a violation: P1: P2: A := 1 X[ 1 ] := A … … A := p …. X[ i ] := A ==> q … … A := q …. X[ j ] := A ==> p then, assuming CMP and RO, WO is violated (formally proved by Collier; however, need to run for “large k”.) How to formally check for all k ?

MP Machine MP Machine Ans: (i) Invoke “data independence” - Consider a Formal Model of the Architecture Wr(A,22), ...Wr(A,33), ... Wd(...), ...Wd(...), ... Rr1(A),... Rr2(A), ... Rd1(A,33), ...Rd2(A,22), ... - Invoke address, data, program-length independence (history streams can be alpha-converted to fresh values) Wr(A,72), ...Wr(A,53), ... Wd(...), ...Wd(...), ... Rr1(A), ...Rr2(A), ... Rd1(A,53), ...Rd2(A,72), ...

(ii) Simulate every violation with a “0/1” violation(Depiction of i, j, p, q for which there is a violation) P1: P2: … … A := p …. X[ i ] := A ==> q … … A := q …. X[ j ] := A ==> p 1 0 0 1 The condition being tested doesn’t depend on p and q ; it simply cares that q > p

(iii) Construct the following “verification run” A := 0 MP Machine A := 1 A => 0 A => 1 A => 0 Error

Collier’s Test for (CMP,RO,WO, WA): P1: P2: P3: P4: A := 1 U[i]:= A X[j] := B B := 1 A := 2 V[i] := B Y[j] := A B := 2 A := 3 U[i+1] := A X[j+1] := B V[i+1] := B Y[j+1] := A … … ... Check that For all i,j : V[i] >= X[j] OR Y[j] >= U[i] WA: All writes seen in the same order (CMP, RO, WO, WA, and program order PO together result in seq consistency. PO is tested separately by a dedicated test similar to these…)

Simulating this violation with a “0/1” violation Exists i,j,a1,a2,b1,b2: a2 > a1 /\ b2 > b1 /\ X[ j ] = b2 /\ V[ i ] = b1 /\ U[ i ] = a2 /\ Y[ j ] = a1 P1: P2: P3: P4: … … … … A := a1 U[ i ]:= A => a2 X[ j ] := B => b2 B := b1 ... V[ i ] := B=> b1 Y[ j ] := A => a1 ... A := a2 B := b2 … … … ... 0 0 1 1 0 1 1 0 Again, the property does not care about the actual values of a1,a2,b1,b2

(iii) Construct the following “verification run” A := 0 MP Machine A := 1 B := 0 B := 1 - Choose non-deterministic time “i” to record U[ i ] and V[ i ] - Choose non-det time “j” to record X[ j ] and Y[ j ] - Check condition

Why this checks for (CMP,RO,WO, WA): P1: P2: P3: P4: a1 b1 a1 a2 b2 b1 a2 b1 a1 b2 b2 a2 WO WO WA violated! RO RO WO WO Violation: V[i] < X[j] /\ Y[j] < U[i] I.e. b1 < b2 /\ a1 < a2

Our experiments done in the context of the HP Runway SMP bus L1 cache Host Data Response (iff no cache has dirty copy) CPU “Host” (mem+ bus master) Transaction: Request Other CPUs Results also posted onto memory Cache-to-Cache write : Response Coherency Responses - Many ordering constraints; e.g. “Delay Host Data Response for 3rd parties, till C2C results posted onto main memory…” - Many other “real” aspects modeled - e.g. pre-fetching ownership, multiple outstanding misses, hit after miss causes stall (speculation happens here, but not modeled, …)

Results • Verification of CMP, RO, WO, WA finishes in “reasonable” amounts of time (10 mins to 10 hours depending on model) • PO verification in progress (one bug unearthed in OUR Verilog model - NOT HP spec!) • Practical tool ARCHVER will be built • Menu-driven choice of orderings to be established • Test “automata” semi-automatically compiled • Need to provide intuitive feedback upon violations • Address / Data / Program-length independence can be syntactically guaranteed (Hojati, Ip, Dill, …)

Conclusions • Formal methods can be productively used during system-level design • Tools to support DSM protocol synthesis and verifying memory ordering rules in the making • The latter can be used to validate high-level DSM protocols too… see http://www.cs.utah.edu/formal_verification for further news, or write to uv@cs.utah.edu

1) DSM Protocol Synthesis 2) Verification of Memory Orderings the Utah Verifier group