Verifying MP Executions against Itanium Orderings using SAT

Verifying MP Executions against Itanium Orderingsusing SAT*Ganesh GopalakrishnanYue YangHemanthkumar SivarajSchool of Computing, University of UtahSalt Lake City, UT, 84112 * Work supported in part by SRC Contract 1031.001 and NSF Award 0219805

Efficient Multiprocessors must have Efficient Shared Memory Systems • * Hide the cost of memory operations by postponing updates • * Increasingly important because CPUs are growing faster • faster than memory systems are

How to build Efficient Shared-memory Multiprocessor Systems? • Employ weak memory models • They permit global state updates to be postponed • Employ aggressive shared memory consistency protocols • Weak memory models permit shared memory consistency protocols to be aggressive without undue complexity (no speculation, etc.: This remark has to do with how SC is implemented aggressively…) The focus of this talk is on weak memory models

Weak memory models allow multiple executions... st c,1 ; st d,2 ld d; ld c CPU CPU Memory One possible execution... st c,1 ; st d,2 ld d, 2; ld c, 0 Impossible under SC Possible under Itanium Another execution... st c,1 ; st d,2 ld d, 2; ld c, 1 Possible under SC and under Itanium

Problems with Weak Memory Models • Hard to understand (easy to misunderstand) P st[x] = 1 mf ldr1 = [y]<0> Q st . rel [y] = 1 R ld . acqr2 = [y]<1> ld r3 = [x]<0> Is this legal under Itanium ? (no)

Post-Si verification of MP Orderings today (oversimplified) assembly program 1 assembly program n Run repeatedly to catch one interleaving that might reveal bug ... New MP System ... Check every execution against ordering rules for compliance assembly execution 1 assembly execution n * This is done ad-hoc * How to make this formal and efficient ? * How to capitalize on repeated re-runs ?

Explanation of Illegal Executions (p 31 of Itanium App Note – search 251429) P st[x] = 1 mf ldr1 = [y]<0> Q st . rel [y] = 1 R ld . acqr2 = [y]<1> ld r3 = [x]<0> la: sr: us: mf: ul2: ul1: • US >> MF ; hence RVr(US)  F(MF) • MF >> UL1 ; hence F(MF)  R(UL1) • …many reasons… hence R(UL1)  RVp(SR) • If RVr(SR)  R(UL1) and RVr(SR)  UL1  RVp(SR) , WB release atomicity of SR • is violated, thus R(UL1)  RVr(SR) • …five lines of reasons Hence RVr(SR)  R(LA) • Since LA >> UL2, R(LA)  R(UL2) • Another para of reasons LV(Sr2)  R(UL2)  LV(SR1)  RVp(SR1)  RVq(SR1)  • F(MF1)  R(UL1)  RVq(SR2)  RVp(SR2). But can’t allow due to atomicity of SR.

Checking Executions and Providing Explanations (present approach) P st[x] = 1 mf ldr1 = [y]<0> Q st . rel [y] = 1 R ld . acqr2 = [y]<1> ld r3 = [x]<0> • Published approaches are very labor-intensive paper-and-pencil proofs • Clearly this can’t scale (6 instruction MP program takes 1-page of detailed • mathematical proof • What about the combinatorics of reasoning about 200 instructions? • Approaches actually used within the industry involves the use of “checkers” • Details of these checkers are unknown (How complete? How scalable?)

Our Approach MP execution to be checked ld . acqr2 = [y]<1> ld r3 = [x]<0> st[x] = 1 mf ldr1 = [y]<0> st . rel [y] = 1 Itanium Ordering rules written in Higher Order Logic Mechanical Program Derivation Checker Program R ld.acqr2 = [y]<1> ld r3 = [x]<0> P st[x] = 1 mf ldr1 = [y]<0> Q st.rel [y] = 1 Satisfiability Problem with Clauses carrying annotations Sat Solver Unsat Sat Unsat Core Extraction using Zcore Explanation in the form of one possible interleaving • Find Offending Clauses • Trace their annotations • Determine “ordering cycle”

Largest example tried to date (courtesy S. Zeisset, Intel) Proc 2 ld4 r24 = [733a74] <415e304> st4.rel [175984] = 96ab4e1f … 67 more instructions… ld8 r87 = [56460] <b5c113d7ce4783b1> Proc 1 st8 [12ca20] = 7f869af546f2f14c ld r25 = [45180] <87b5e547172644a8> … 58 more instructions… st2 [7c2a00] = 4bca • Initially the tool gave a trivial violation • Diagnosed to be forgotten memory initialization • Added method to incorporate memory initialization in our tool • Our tool found the exact same cycle as pointed out by author of test • Sat generation and Sat solving times need improving Cycle found thru our tool: st.rel(line 18, P1)  ld (line 22, P2)  mf  ld (line 30, P2)  st (line 11, P1)

Statistics Pertaining to Case Study Proc 2 ld4 r24 = [733a74] <415e304> st4.rel [175984] = 96ab4e1f … 67 more instructions… ld8 r87 = [56460] <b5c113d7ce4783b1> Proc 1 st8 [12ca20] = 7f869af546f2f14c ld r25 = [45180] <87b5e547172644a8> … 58 more instructions… st2 [7c2a00] = 4bca • All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon • ~2 minutes to generate Sat instance • 14,053,390 clauses • 117,823 variables • ~1 minute to solve Sat problem - found Unsat • Unsat Core generation runs fast – gave 23 clauses! • 23 of the 14M clauses were causing the problem to be Unsat • Sat time for these 23 clauses … under a second • Unsat Core’s annotations were traced back to offending instructions and • the memory ordering rules that situated them in a “cycle”

The rest of the talk • Itanium memory model in Higher Order Logic (well, not so high actually…  ) • Our HOL specs  translation  “sat-generating checker programs” • Execution to be checked  translation by above program to Sat • Each assembly instruction  clauses it generates + annotations • When Sat, what interleaving explains? • When Unsat, how to get “core” (root-cause) + annotations on core • Translating annotations on core to cycle on original program

Itanium memory model in Higher Order Logic (well, not so high actually…  ) • The initial focus of our presentation : • How to model an execution ? • Why use “split stores” in modeling ?

Itanium memory model in Higher Order Logic (well, not so high actually…  ) Basic problem-modeling idea: Find a “shuffle” of the instructions that explains the observations… P1 Explanation… P0 st[y] = 1 ld reg2 = [y] <1> st[y] = 1 ld reg1 = [y] <1> ld reg1 = [y] <1> ld reg2 = [y] <1> The basic idea won’t always work … st.rel[y] = 1 st.rel[x] = 2 No Shuffle of these sequences respecting satisfies the read-values Dat. Dep. Dat. Dep. ld.acq r3 = [y] <1> ld.acq r4 = [x] <2> Ld . Acq Order Ld . Acq Order “ ” ld reg1 = [x] <0> ld reg2 = [y] <0>

Problem Modeling… Idea: Find a shuffle after each store is split into (p+1) copies…. (by the way, this idea has sort of become “standard”) P1 P0 st[y] = 1 st[x] = 2 Local copy for P0 A similar split “remote” copy for P0 “remote” copy for P1 Now, arrange the split copies…

Problem Modeling… P1 P0 st[y] = 1 st[x] = 2 ld.acq r3 = [y] <1> ld.acq r4 = [x] <2> ld reg1 = [x] <0> ld reg2 = [y] <0> st[y] = 1 “l” st[x] = 2 “l” st[y] = 1 “rp0” st[x] = 2 “rp0” st[y] = 1 “rp1” st[x] = 2 “rp1” Now, arrange the split copies… st[y] = 1 “l” Explanation… ld.acq r3 = [y] <1> Dependencies st[x] = 2 “l” ld.acq r4 = [x] <2> st[y] = 1 “rp0” st[x] = 2 “rp1” ld reg1 = [x] <0> st[x] = 2 “rp0” Anti- dependencies ld reg2 = [y] <0> st[y] = 1 “rp1”

Back to Itanium memory model in Higher Order Logic thru an example Informal statement: Store-Releases to write-back memory become visible to all processors in the same order Implementation: All copies of a “split st.rel” are visible atomically st.rel [x] = 1 Atomic set

One standard way of specifying atomicity: All other events “e” are strictly before or strictly after the atomic set e e Another standard way of specifying atomicity: If some event “e” is between two events in the atomic set, then “e” also belongs to the atomic set e e

atomicWBRelease rule (Section 3.3.7.1 of Intel App Note): atomicWBRelease(ops,order) = Forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID) i k j We have reduced the ~36 page Intel App Note to ~3 pages of HOL rules (barring a few simple omissions…)

Basic idea behind Intel’s Formal Spec (which we follow in our formal spec) Make it look like SC so that people have less trouble understanding! legalItanium(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireWriteOperationOrder ops order /\ requireProgramOrder ops order /\ requireMemoryDataDependence ops order /\ requireDataFlowDependence ops order /\ requireCoherence ops order /\ requireAtomicWBRelease ops order /\ requireSequentialUC ops order /\ requireNoUCBypass ops order /\ requireReadValue ops order SC(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireProgramOrder ops order /\ requireReadValue ops order Call it “otherOrder”

But, how do we check executions against such specs? legalItanium(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireWriteOperationOrder ops order /\ requireProgramOrder ops order /\ requireMemoryDataDependence ops order /\ requireDataFlowDependence ops order /\ requireCoherence ops order /\ requireAtomicWBRelease ops order /\ requireSequentialUC ops order /\ requireNoUCBypass ops order /\ requireReadValue ops order SC(ops) = Existsorder. ( requireStrictTotalOrder ops order /\ requireProgramOrder ops order /\ requireReadValue ops order Execution 1 Execution 2 st c,1 ; st d,2 ld d, 2; ld c, 0 st c,1 ; st d,2 ld d, 2; ld c, 1 e.g., which execution is legal under which memory model ?

Itanium memory model in Higher Order Logic (well, not so high actually…  ) • Our HOL specs  translation  “sat-generating checker programs”

Transformation of HOL specs to generate constraints atomicWBRelease(ops,order) = forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID) atomicWBRelease(ops,order) = forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) atomicWBRelease(ops,order) = forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in ops). (i.wrID = k.wrID) ==> forall (j in ops). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) Initial Spec Applying Contrapositive After Reducing quantifier Scopes

Functional (Ocaml) Program Derivation from HOL Specs: atomicWBRelease(ops,order) = forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in ops). (i.wrID = k.wrID) ==> forall (j in ops). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k)) atomicWBRelease(ops) = forall(i,ops,wb(i)) wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,ops,wb1(i,k)) wb1(i,k) = if ~(i.wrID=k.wrID) then true else forall(j,ops,wb2(i,k,j)) wb2(i,k,j) = if (j.wrID=i.wrID) then true else ~(order(i,j) & order(j,k)) forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *) Transformed Spec Functional Program that generates the constraints (will be automated)

Itanium memory model in Higher Order Logic (well, not so high actually…  ) • Our HOL specs  translation  “sat-generating checker programs” • Execution to be checked  translation by above program to Sat

Have built tool for tuple-generation that addresses many details: (1) Expansion into tuples with variable address allocation P1: St a,1; Ld r1,a <1>; St b,r1 <1>; P2: Ld.acq r2,b <1>; Ld r3,a <0>; Tuple 1 {id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false}; {id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false}; {id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false}; {id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true}; {id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true}; {id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true}; {id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true}; {id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true}; {id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true} ... Tuple 9

How the SAT encoding is achieved... Example Execution • Store c viewed at P1 for modeling bypassing • Store c viewed at P1 for modeling global visibility • Store c viewed at P2 for modeling global visibility • Store d viewed at P1 for modeling bypassing • Store d viewed at P1 for modeling global visibility • Store d viewed at P2 for modeling global visibility • Ld d viewed at P2 for modeling read value • Ld c viewed at P2 for modeling read value st c,1 ; st d,2 ld d, 2; ld c, 0 Break it down into “tuples” 8 tuples obtained legalItanium(ops) = Exists order. ( requireStrictTotalOrder ops order /\ requireOtherOrderItanium ops order /\ requireReadValue ops order SC(ops) = Exists order. ( requireStrictTotalOrder ops order /\ requireOtherOrderSC ops order /\ requireReadValue ops order

Constraint Encoding Approach #1 • n logn approach (“small domain” encoding) • Attach a word w_t of 2 bits to each tuple t • Tuple i before Tuple j --> Assert wi < wj • StrictTotalOrder --> Assert that the wt words are distinct • Smaller # of Boolean Vars • Much Harder SAT instances (abandoned for now) Illustration on 4 tuples requireStrictTotalOrder ops order requireOtherOrder ops order requireReadValueops order For all i, j: xi1,xi0 != xj1, xj0 x00 x01 x10 x11 A system of constraints with primitive constraint xi1, xi0< xj1, xj0 x20 x21 x30 x31

Constraint Encoding Approach #2 • n n approach (“e_ij” encoding) • Assign a matrix position mij for each pair of tuples ti and tj • Tuple i before Tuple j --> Assert mij true • StrictTotalOrder --> Assert Irreflexitivity, Transitivity, Totality • Larger # of Boolean Vars • Easier SAT instances (being pursued now) Illustration on 4 tuples • Forall i : ~mii • Forall i,j : mij \/ mji • Forall i,j,k : mij /\ mjk • => mik requireStrictTotalOrder ops order requireOtherOrder ops order requireReadValueops order i . . . . j . mij . . . . . . . . . . A system of constraints with primitive constraint mij

Table of Results (somewhat dated…) SAT-instance generation time for n logn method Tuples Total Order Other Order 32 0.2 1.6 64 1.2 17.1 128 5.7 179.0 SAT-instance generation time for n n method Tuples Total Order Other Order 32 0.5 0.1 64 4.3 0.9 128 34.2 9.0 SAT-checking times Tuples n logn nn Monolith TotalOrd OtherOrd Monolith TotalOrd OtherOrd 32 9.6 0.6 4.3 0.33 0.69 0.05 64 247.17 29.53 37.6 2.73 6.17 0.5 128 abort 1341 abort 164.8 145.6 351.1

Explaining the results of Sat • Itanium memory model in Higher Order Logic (well, not so high actually…  ) • Our HOL specs  translation  “sat-generating checker programs” • Execution to be checked  translation by above program to Sat • Each assembly instruction  clauses it generates + annotations • When Sat, what interleaving explains? • When Unsat, how to get “core” (root-cause) + annotations on core • Translating annotations on core to cycle on original program

Clause Annotations • Each clause generated by the sat-generating checker program also generates an associated tuple. • This tuple has information pertaining to the clause’s source. • Each tuple has the following information • The ops involved in generating the clause (upto a maximum of 4 ops could generate a clause) • The proc value of the processor whose instructions were used to generate this clause (taken from the tuples generated by the gentuple program) • The pc value of the instruction that was the source for this tuple • The name of the memory ordering rule the application of which generated this tuple (ReadValue, ProgramOrder, Reflexive, etc) • The clause annotation looks as follows < proc, pc, op1, op2, op3, op4, RuleName >

P st[x] = 1 mf ldr1 = [y]<0> Q st.rel [y] = 1 R ld.acqr2 = [y]<1> ld r3 = [x]<0> Example execution (Table 18, pg. 31 of App note) • The Sat instance generated for the above example is • UNSAT. • Next few slides show automated approach to detect • the root cause cycle. • We will ignore the reflexive and transitive rules in • these slides (they are necessary to force unsat, but • useless in building a cycle!!)

Clause annotations for the unsat core for example op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexive op1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrder op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder op1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrder op1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrder op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrder op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrder op1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrder op1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrder op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

denotes an op 123 4 st[x] = 1 mf 5 Denotes op numbers. Store has both local and remote ops 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 st[x] = 1 op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder mf 5 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 st[x] = 1 mf 5 op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 st[x] = 1 op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = R eadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue mf 5 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease st[x] = 1 mf 5 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 st[x] = 1 op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue mf 5 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 st[x] = 1 mf 5 op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

123 4 st[x] = 1 op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValue op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue mf 5 6 ldr1 = [y]<0> 7 8 9 10 st.rel [y] = 1 ld.acqr2 = [y]<1> 11 ld r3 = [x]<0> 12

Good Case-study Illustrating Program Derivation from Formal Specs • Initial specs: HOL • Formal derivation of tail-recursive functional programs • “Code generation” consists of generating Boolean clauses • Choose Boolean encoding method • Re-target code generation correspondingly • Source-level optimizations • Record known orderings (e.g., “i before j”) – these manifest as unit clauses • Infer others (e.g., “not j before i”) - generate unit-clauses for these too • Prevent generating transitivity axioms that depend on “j before i” • The use of incremental SAT can perhaps be directed by “functional scripts” that are automatically generated • Use of Unsat cores to pinpoint errors

Concluding Remarks • Main source of complexity: the transitivity axiom • “Lazy” methods for handling transitivity must be investigated • Hybrid Sat encoding (partly nn and partly n log n) can also help as was the experience of Lahiri, Seshia, and Bryant • Analyzing larger programs: • Somehow view program in terms of “basic blocks” • Treat each basic block as super instruction • If super-instruction unordered, no need to descend into basic block • Exploit incremental Sat when same litmus tests are rerun • Try modeling another weak memory model

Extra Slides

Unsat Core generation • The CNF file generated by the sat-generating program is solved using zchaff. • If SAT, then we get a satisfying assignment. • First n*n variables in the assignment correspond to the n*n variables in our ordering. Can be used to output a valid ordering of the ops. • If UNSAT, then need a way to find a “root-cause” for the illegality of the execution. • We use unsatisfiable core generation to get to the root cause. • An unsatisfiable core of an unsatisfiable Sat instance is a subset of clauses of the formula such that its conjunction is still UNSAT.

Generating Unsatisfiable Core • Zchaff can be told to generate resolution trace while checking for Sat. • Zcore – tool that takes as input a CNF file and resolution trace produced by zchaff and produces unsatisfiable core. • Zcore available as part of zchaff. • Unsatisfiable core is another CNF file with the reduced set of clauses. • Can be fed back into zchaff/zcore to generate a potentially smaller unsatisfiable core. • Process repeated till fixed point reached.

Mapping back to root-cause • Clauses in the unsatisfiable core contain the ordering violation information in them • Tool to home in towards the root-cause for the violation • If the root cause is not something trivial, then the cause is usually a cycle of instructions. Each link in the cycle corresponds to an ordering requirement between the instuctions involved. • If cycle exists, then Transitivity can be applied to show that Irreflexivity is not satisfied. • Input to the tool to generate root cause: • The original set of annotated machine instructions for all processors • The default values stored in memory locations at the beginning of the execution • Clause annotations for the clauses that form the unsatisfiable core

Root-cause cycle analysis algorithm Each ReadValue rule generates a set of clauses. From the annotations, find the tuples that come from the same ReadValue rule (two different ops will be involved in a rule) • Extract the ops out of the annotations and get the corresponding instructions (using the proc and pc values) From the data being used in the ld instruction and the default date value for the corresponding memory address, it can be seen if the effect of a store is being reflected in a load. This way the dependency between a load and a store is established. The above is done for all the ReadValue rules in the annotations Ops (and the corresponding instructions) on both sides of a mf that form a link in the cycle are inferred based on ProgramOrder rule annotations and the pc values involved. The other missing links in the violating cycle are also inferred based on the remaining ProgramOrder rule annotations.

Verifying MP Executions against Itanium Orderings using SAT