150 likes | 280 Views
Applying Formal Methods to Protocol Specifications and System Architecture. Ching-Tsun Chou Multi-Processor Architecture Enterprise Platforms Group Intel Corporation. Disclaimer. The views expressed in this talk are the presenter’s alone and not necessarily those of Intel Corporation.
Applying Formal Methods to Protocol Specifications andSystem Architecture Ching-Tsun Chou Multi-Processor Architecture Enterprise Platforms Group Intel Corporation
Disclaimer The views expressed in this talk are the presenter’s alone and not necessarily those of Intel Corporation
Why formal methods? • Architectural specifications contain complex distributed protocols whose correctness is nontrivial to establish • Examples: Directory-based cache coherence protocols, forward-progress mechanisms, variations of sliding window protocols, … • Goal: Get protocol specifications correct before implementations commence • The earlier a bug is found, the easier it is to fix it, and the more flexibility there is in possible fixes • Formal verification (FV) is a body of powerful techniques for achieving this goal • Formal modeling promotes clear thinking and minimizes misunderstanding and misinterpretation of specifications • In the early stages of protocol design, more bugs are found during formal modeling than by model checking • Protocol design and formal modeling should go hand in hand • Formal modeling produces unambiguous “golden models” of at least some aspects of complex protocols • Executable reference models can be generated from formal models • Experience shows that formal methods work • Already a standard industry practice: Intel, Sun, IBM, Compaq, SGI, ... • As this talk hopes to demonstrate
Formal vs simulation-based verification • Simulation-based verification: Check asmallfraction of all possible behaviors of alargemodel + Very large and relatively complete model + Model need not be simplified or abstracted – Only a very small part of the state space is explored – Need to generate tests and collect coverage feedback > Results only as good as your tests and checkers • Formal verification: Check allpossible behaviors of asmallmodel + All states are exhaustively explored + No tests are needed and coverage is 100% – Only very small models can be handled – Often need drastic simplification and abstraction > Results only as good as your models and properties Moral: There is no free lunch!
Overview of Intel’s Scalability Port (SP) architecture • Designed for mid-range shared memory multiprocessors • Employ high-speed point-to-point interconnect that provides good scalability for mid-range to high-end systems • Shared buses are neither cost-effective nor scalable beyond limited number of processors due to signaling, thermal, mechanical, and other challenges • Support flexible system architecture • Enable cost-optimal small systems to scalable high-end systems • Enable system vendors with proprietary system interconnects and components to use Intel building blocks • An instance of SP was implemented in Intel’s E8870 chipset
Overview of SP cache coherence protocol • Make no assumption whatsoever about the relative timing of events or the ordering of messages • Completely asynchronous, event-driven specification • Directory-based, though the directory: • is optional (no directory = null directory) • may or may not be physically distributed • A generalization of the invalidation-based MESI protocol, but different caching agents may do MESI-state transitions at different times • Employ mechanisms to resolve conflicts — collisions of requests from different requesters to the same cache line — in a distributed manner • Table-based specification with ~1,000 rows in all tables • >20 transaction types, each of which has a different behavior by itself and can interact with every other transaction type
SP cache coherence protocol validation flow Protocol specification Boolean rules Non-table-based code Extracted p-tables Generated p-tables Formal verification model C reference model ? = Model checking Simulation Find “easy” bugs in protocol spec Find “hard” bugs in protocol spec Find bugs in implementations
Properties verified • Data consistency: • If a cache’s state is valid (i.e., S, E, or M), then its data is up to date • Cache and directory state consistency: • If any cache is in state E or M, the other caches must be in I • If a presence bit in directory is 0, the corresponding cache must be in I • If the directory state is I, all presence bits are 0 (and hence all caches are I) • If the directory state is S, the caches whose presence bits are 1 are in I or S • If the directory state is E, there is exactly one presence bit being 1 • Weak liveness properties: AG EF (cs = CS), for each “control state” cs and each possible value CS of cs • Excellent guard against missing rows in protocol tables and other unexpected cases • Detect both global and local deadlocks, but not livelock or starvation • Do not rely on fairness assumptions
Results of SP cache coherence protocol FV • An SP cache coherence protocol has >10^33 reachable states for a configuration containing 1 cache-line address, 1 home node, 2 caching nodes, and all >20 transaction types • Each property takes (on the average) 4~5 hours to model-check on a 700 MHz Pentium III Xeon machine with 4 GB of physical memory • Many interesting bugs were found in successive versions of SP cache coherence protocol by both formal modeling and model checking • In fact, more bugs were found by the former than by the latter in the early phase of SP protocol design • Not surprisingly, most problems were found when SP was first designed and during major revisions (e.g., when new transaction types were added) • But even minor revisions could introduce problems • Moral: As far as cache coherence protocols are concerned, unaided human reasoning should not be trusted
Rule-based table checking flow Specification document in word processor (e.g,. FrameMaker) Specification document in HTML Convert Extract & flatten Rules Pre-processed table Post-process Generate ? = Generated table Post-processed table
Why rule-based table checking works • Tables and rules take two fundamentally different but complementary views: • Tables are row-centric and enumeration of cases (row = case) • Rules are column-centric and expression of relationships between columns • By comparing the two views against each other, the chance of a bug escaping is minimized • Ideally, tables and rules should be constructed by two different persons • Expression of complex relationships between visible columns is simplified by means of hidden columns • “Cause-and-effect” metaphor: Hidden columns are the “ultimate” but invisible “causes” of visible columns • Hidden columns are hidden by existentially quantifying them away • Hidden columns are used to increase further the difference of the two views
Results of rule-based table checking • Coded boolean rules for SP protocol tables and checked them against each other • Typically dozens of errors were found before tables and rules agree • Most errors were trivial (e.g., typos), but some were more serious (e.g., missing cases or systematic misunderstanding) • Maintained the agreement between tables and rules over 2 years and tens of major and minor protocol revisions • Changing rules to keep up with tables almost never required more efforts than changing tables themselves • Rule-based table checking is our first line of defense, flushes out virtually all “easy” bugs, and has very low computational overhead • It takes < 5 minutes to extract and verify by rules all SP protocol tables • We are not advocating that “code review” of tables be eliminated • “Code review” is still a must at the beginning • We do advocate that insights from “code review” be captured and codified by rules and re-used later when tables are changed
Novel applications of binary decision diagrams • Rule-based table generation and checking • Boils down to enumerating satisfying “truth” assignments of boolean expressions over enumerated types • Search for minimal deadlock-free wormhole routing scheme • A wormhole routing scheme is deadlock-free Its channel dependency graph is acyclic The transitive closure of the graph contains no self-loop • Hence reducible to BDD fixpoint computation • Details in our FMSD paper • Search for fault-tolerant link initialization sequences • Details in our FMSD paper • Observations: • “Formal methods thinking” leads to new ways of looking at old problems • A little BDD goes a long way: • BDD is an efficient representation of boolean rules (1. above) • BDD supports exhaustive search of a “solution space” (2. & 3. above)
Lessons learned • Formal modeling steered us toward more precise and concrete protocol specifications than we would have written without it • Even an abstract formal model requires one to spell out what exactly one means by each protocol structure and action • Formal modeling also turned out to be an excellent way to help architects articulate their ideas • Formal verification gave us much higher confidence in the correctness of our protocol specifications than we would have without it • Certain distributed protocols (e.g., directory-based cache coherence protocols) are too complex for unaided human reasoning alone to get correct • Formal verification makes it less risky to modify protocol specifications • Architecture definition affords a rich and fruitful area for the application of formal methods • Avoid state explosion with a high level of abstraction • Get to bugs at the earliest possible stage • Encourage architects to choose more “validation-friendly” schemes • Applying formal methods to a specification enables the exploration of design spaces that are beyond the scope of any particular implementation • Especially important for Intel, which defines architectures that will be implemented by multiple vendors over multiple product generations
Acknowledgements • Mani Azimi, Jay Jayasimha, Akhilesh Kumar, Victor W. Lee, Phanindra K. Mannava, Seungjoon Park, and Aniruddha Vaidya all contributed to the work described above.