Presenters: Ganesh Gopalakrishnan Xiaofang Chen School of Computing, University of Utah

Scaling Formal Methods toward Hierarchical Protocols in Shared Memory Processors:Annual Review Presentation – April 2007 Intel SRC Customization Award 2005-TJ-1318 Presenters: Ganesh Gopalakrishnan Xiaofang Chen School of Computing, University of Utah Salt Lake City, UT

Project Personnel • IBM Mentor: Dr. Steven M. German • Intel Mentor: Dr. Ching-Tsun Chou • Primary Student: • Xiaofang Chen • Summer internship planned - IBM T.J. Watson (6/07) where the research discussed here in Project 2 will be furthered • Other SRC Student: • Robert Palmer (work involving TLA+ modeling of communication libraries) • Defense May 10; Expected to join Intel (6/07) • 3 other PhD students, 1 MS student, 2 UGs in FV • all working on FV of threading / msg-passing software

Multicores are the future!Their caches are visibly central… > 80% of chips shipped will be multi-core (photo courtesy of Intel Corporation.)

Cluster 1 Cluster 2 Cluster 3 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache+Local Dir L2 Cache+Local Dir L2 Cache+Local Dir Interface Interface Interface Global Dir Main Memory …and the number of organizations of multiprocessor caches is mindboggling (e.g. imagine 80 cores and deeper hierarchies). Shared / Private Inclusive / Exclusive

Protocol design happens in “the thick of things” (many interfaces, constraints of performance, power, testability). From “High-throughput coherence control and hardware messaging in Everest,” by Nanda et.al., IBM J.R&D 45(2), 2001.

Future Coherence Protocols • Cache coherence protocols that are tuned for the contexts in which they are operating can significantly increase performance and reduce power consumption [Liqun Cheng] • Producer-consumer sharing pattern-aware protocol [Cheng, HPCA07] • 21% speedup and 15% reduction in network traffic • Interconnect-aware coherence protocols [Cheng, ISCA06] • Heterogeneous Interconnect • Improve performance AND reduce power • 11% speedup and 22% wire power savings • Bottom-line:Protocols are going to get more complex!

Designers have poor conceptual tools (e.g., “Informal MSC drawings”). Need better notations and tools. GDir L1-1 L1-2 LDir (S) (I) (S: L1-1) Swap Req_S Broadcast Fwd_Req NAck Gnt_S (S: L1-2) Gnt_S

Design Abstractions in More Modern Flows • An Interleaving Protocol Model (Murphi or TLA+ are the languages of choice here) • FV here eliminates concurrency bugs • Detailed HDL model • FV here eliminates implementation bugs; however • Correspondence with Interleaving Model is lost • Need more detailed models anyhow • Interleaving Models are very abstract • Monolithic Verification of HDL Code Does not Scale • Design optimizations captured at HDL level • Interleaving model becomes more obsolete • Need an Integrated Flow: • Interleaving -> High level HW View -> Final HDL

Related Work in Formal HW Design • BlueSpec • High level design is expressed using atomic transactions • Synthesizes high level designs into hardware implementations • Automatic scheduling of high level design steps in hardware • May not meet performance goals • Malik et.al. Formal Architecture and Microarchitecture Modeling for Verification • Meant for Instruction Set Processors • Need Formal theory of Refinement from Interleaving to High level HW Models

Our Goals • Develop Methodology to Verify “Realistic” Interleaving Models • Useful Benchmarks for others • Our particular contributions are towards Hierarchical protocols • Largely Inspired by Chou et.al.’s work (FMCAD’04) • Xiaofang Chen’s PhD is wrapping up a nice story here! • Develop Language and Formal Theory for Higher Level HW Specification & Refinement • Ideas largely due to German & Janssen • Xiaofang Chen’s PhD work is taking ideas from initial proposal all the way to practical realization!

A summary of our work over Y1-2 • Three progressively better approaches to verify hierarchical cache coherence protocols at the interleaving level • A/G method of complementary abstractions (FMCAD’06) • Extensions to Non-inclusive hierarchies (TR 06-014) • Abstract each level separately (to be submitted) • Error-trace checking (to be submitted) • A theory of transaction based design and verification (writeup finished; initial experiments finished) • Modular verification of transactions (writeup in progress; initial experiments finished) Number the projects 1.1, 1.2, 1.3, 1.4, 2, and 3

Project 1.[1-4] Timeline 1.3: Abstraction per level (more scalable) 1.1: FMCAD’06 results 1.2: Another hierarchical benchmark (non-inclusive) 1.4: Automatic Recognition of spurious/real bugs

1.[1-4]: Hierarchical Protocols Remote Cluster 1 Home Cluster Remote Cluster 2 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache+Local Dir L2 Cache+Local Dir L2 Cache+Local Dir RAC RAC RAC Global Dir Main Memory

Abstracted Protocol #1 Home Cluster L1 Cache L1 Cache Remote Cluster 1 Remote Cluster 2 L2 Cache+Local Dir’ L2 Cache+Local Dir L2 Cache+Local Dir’ RAC RAC RAC Global Dir Main Memory

Abstracted Protocol #2 Remote Cluster 1 L1 Cache L1 Cache Home Cluster Remote Cluster 2 L2 Cache+Local Dir L2 Cache+Local Dir’ L2 Cache+Local Dir’ RAC RAC RAC Global Dir Main Memory

Non-Circular Assume/Guarantee • We can’t verify this due to state explosion: • h ║ r1 ║ r2 ╞ Coh • Instead • Check-1: h ║ R1 ║ R2 ╞ Coh1 Λ Guarant1 • Check-2: H ║ r1 ║ R2 ╞ Coh2 Λ Guarant2

1.2: We applied the non-circular A/G method to a Non-Inclusive Hierarchical Protocol…. • Protocol features • Broadcast channels • Non-imprecise local dir • Verification challenges • A/G cannot infer local dir from just intra-clusters • Coherence may involve multiple L1 caches

Verifying Non-Inclusive Protocols • Inferring “L2.State = Excl” from • Outside the cluster • Inside the cluster • Use history variables to change non-inclusive to inclusive protocols

Experimental Results Reduction is over 65%

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache+Local Dir L2 Cache+Local Dir 1.3: We then tried a “Split Hierarchy Per Level Approach” to using non-circular A/G ABS #1 ABS #2 L2 Cache+Local Dir’ L2 Cache+Local Dir’ L2 Cache+Local Dir’ RAC RAC RAC Global Dir Main Memory ABS #3

A Sample Scenario Remote Cluster 1 Home Cluster Remote Cluster 2 Excl Invld 4. Fwd Req_Ex 5. Grant 1. Req_Ex 6. Grant 3. Fwd Req_Ex 2. Fwd Req_Ex

Map to Abstracted Protocols Remote Cluster 1 Remote Cluster 2 Invld Excl 4. Fwd Req_Ex 5. Grant 1. Req_Ex 6. Grant 2. Fwd Req_Ex 3. Fwd Req_Ex

Experimental Results Reduction is over 95% !

Project 1.4: Automatic Recognition of Spurious / Real Bugs in these approaches • Problem statement • Given an error trace of ABS protocol • Is it a real bug of the original protocol? • Solution • In the original protocol, using BFS to guide the model checking to match the error trace Reason: because our abstraction is just projection

Basic Idea of Automatic Recognition Error trace of Abs. protocol Directed BFS of original protocol v1=0, v2=0, v3=0 v1=0, v2=0 keep keep drop v1=1, v2=2, v3=1 v1=3, v2=1, v3=0 v1=0, v2=0, v3=3 v1=1, v2=2 …… …… …… v1=6, v2=8

Y3 Plans for Project 1: • Considerable Experience Gained • Three Large Benchmark Protocols (each is 3000+ lines of Murphi Code) • on the web • Have Reduced Verif Complexity of Hier Protocols by 90% • Can Identify Spurious Errors Automatically • All Finite-state • Not Parameterized • No plans for Parameterized • Y3 Plans: Build Tool to support this methodology

Summary of Projects 2 and 3 • Three progressively better approaches to verify hierarchical cache coherence protocols at the interleaving level • A/G method of complementary abstractions (FMCAD’06) • Extensions to deeper, and non-inclusive hierarchies (TR 06-014) • Latest method that abstracts each level separately (to be submitted) • Error-trace checking (to be submitted) • A theory of transaction based design and verification (writeup finished) • Modular verification of transactions (writeup in progress)

Transaction Level HW Modeling The problem addressed: Bridge the gap between high-level specifications and RTL implementations • Global properties cannot be formally verified at RTL Level! • Specifications can be verified, but do they correctly represent the implementations?

Driving Design Benchmark due to German and Geert Janssen

What changes when moving from a spec to an implementation? • Atomicity • Concurrency • Granularity in modeling 1 1.1 1.3 home client home client 1.2 router buffer

General Mappings between high level transitions and transactions that help implement them High Level Transition 1 1 High Level Transitions take some non-zero unit of time (conceptual) Low Level Transitions that help realize 1 1.2 1.1 Each Low Level Transition takes One Clock Cycle 1.3

High-Level and Low-Level Computations 1 2 3 1.2 1.1 1.3 2.1 2.2 3.1 3.3 3.2

Specification of High and Low Levels 1 In Murphi as a Guard  Action Rule 1.2 1.1 In HMurphi as Multiple Guard  Action Rules enclosed in a Begin Transaction / End Transaction The Guards Decide when each low level transition can fire The Maximal Number of Low Level Transitions Enabled in any state are concurrently fired within each clock tick 1.3

Transaction • A transaction is a set of transitions in Impl that correspond to a transition in Spec Transaction Rule 1 …… Rule n Endtransaction;

Executions • Spec: interleaving • One enabled transition fires at each step • Impl: concurrent • All enabled transitions fire at each step …… 12 3 {1.1, 2.1} {1.2} {2.2, 3.1, 3.2} ……

A Few Notations • Observable variables: VH • These are Variables used in both Spec and Impl • Impl has additional internal variables also • A variable v is inactive at a state s if all transactions in Impl that can write to v are quiescent at s

A Formal Notion of Simulation • For every concurrent execution of Impl, exists an interleaving execution of Spec, VH∩ inactive(li) match {…} {…} {…} …… l0 l1 l2 t0 t1 t2 …… h0 h2 h1

Simulation Checks Guard for Spec transition must hold Spec transition Spec(I) Spec(I’) Observable vars changed by either Spec or Impl must match Impl transaction I I’ I is a reachable state where the commit guard is true

Model Checking Approaches • Monolithic • Cross product construction • Compositional • Abstraction • Assume/Guarantee

Compositional Approach • Abstraction • Change read to an access of an input var • Self-sourced read • Add all transitions that write to a var • Assume/Guarantee • Require all writes to var guarantee prop P • Assume P holds on all reads

Example of Abstraction Transaction 1 Transaction … Rule (v1 = d1) => ... … Endtransaction Transaction 2 …… Transaction n

Example of Assume/Guarantee Transaction 1: Request granted State := Excl … Impl.State = Spec.State Data := d Transaction 2: Update Cache

Benchmarks • High level in FMCAD’04 tutorial • Low level provided by German and Janssen • Sizes: • 1 Home node, 1 remote node Sizes are constrained by accessible VHDL tools!

Implementations • Muv: HMurphi  VHDL • Written by German • Mud: • Static analyzer for possible conflicts / dependencies • VHDL verifier • IBM RuleBase

Preliminary Results * This is for datapath = 1 bit * Intel Xeon CPU 3.0GHz, 2GB memory

When Datapath > 1 bit • Cannot check monolithic approach • RuleBase 300 F-F academic license restriction • Decomposed approach • W/W checks not affected

Future Work • Reduce the cost of W/W conflicts checking • Localized reasoning • Apply to pipeline • More benchmarks • Try other VHDL tools • SixthSense etc.

Publications, Software, Models • FMCAD 2006 paper • Presentation at Intel • Journal version of hierarchical coherence protocol verification (under prep) • TR on Theory of Transaction Based Specification and Verification (under prep) • Detailed VHDL-level German Protocol developed • Analysis Framework for HMurphi Developed • Preliminary Verification Experiments using Cadence IFV, IBM RuleBase, and IBM SixthSense • Xiaofang Chen’s Summer Internship at IBM T.J. Watson Res. Ctr. • Robert’s SRC Poster • Techcon 2007 submission  There will be more publications during 2007-8 following hiatus due to infrastructure build-up (many delays!)

Presenters: Ganesh Gopalakrishnan Xiaofang Chen School of Computing, University of Utah