150 likes | 277 Views
Applying Formal Methods to Protocol Specifications and System Architecture. Ching-Tsun Chou Multi-Processor Architecture Enterprise Platforms Group Intel Corporation. Disclaimer. The views expressed in this talk are the presenter’s alone and not necessarily those of Intel Corporation.
E N D
Applying Formal Methods to Protocol Specifications andSystem Architecture Ching-Tsun Chou Multi-Processor Architecture Enterprise Platforms Group Intel Corporation
Disclaimer The views expressed in this talk are the presenter’s alone and not necessarily those of Intel Corporation
Why formal methods? • Architectural specifications contain complex distributed protocols whose correctness is nontrivial to establish • Examples: Directory-based cache coherence protocols, forward-progress mechanisms, variations of sliding window protocols, … • Goal: Get protocol specifications correct before implementations commence • The earlier a bug is found, the easier it is to fix it, and the more flexibility there is in possible fixes • Formal verification (FV) is a body of powerful techniques for achieving this goal • Formal modeling promotes clear thinking and minimizes misunderstanding and misinterpretation of specifications • In the early stages of protocol design, more bugs are found during formal modeling than by model checking • Protocol design and formal modeling should go hand in hand • Formal modeling produces unambiguous “golden models” of at least some aspects of complex protocols • Executable reference models can be generated from formal models • Experience shows that formal methods work • Already a standard industry practice: Intel, Sun, IBM, Compaq, SGI, ... • As this talk hopes to demonstrate
Formal vs simulation-based verification • Simulation-based verification: Check asmallfraction of all possible behaviors of alargemodel + Very large and relatively complete model + Model need not be simplified or abstracted – Only a very small part of the state space is explored – Need to generate tests and collect coverage feedback > Results only as good as your tests and checkers • Formal verification: Check allpossible behaviors of asmallmodel + All states are exhaustively explored + No tests are needed and coverage is 100% – Only very small models can be handled – Often need drastic simplification and abstraction > Results only as good as your models and properties Moral: There is no free lunch!
Overview of Intel’s Scalability Port (SP) architecture • Designed for mid-range shared memory multiprocessors • Employ high-speed point-to-point interconnect that provides good scalability for mid-range to high-end systems • Shared buses are neither cost-effective nor scalable beyond limited number of processors due to signaling, thermal, mechanical, and other challenges • Support flexible system architecture • Enable cost-optimal small systems to scalable high-end systems • Enable system vendors with proprietary system interconnects and components to use Intel building blocks • An instance of SP was implemented in Intel’s E8870 chipset
Overview of SP cache coherence protocol • Make no assumption whatsoever about the relative timing of events or the ordering of messages • Completely asynchronous, event-driven specification • Directory-based, though the directory: • is optional (no directory = null directory) • may or may not be physically distributed • A generalization of the invalidation-based MESI protocol, but different caching agents may do MESI-state transitions at different times • Employ mechanisms to resolve conflicts — collisions of requests from different requesters to the same cache line — in a distributed manner • Table-based specification with ~1,000 rows in all tables • >20 transaction types, each of which has a different behavior by itself and can interact with every other transaction type
SP cache coherence protocol validation flow Protocol specification Boolean rules Non-table-based code Extracted p-tables Generated p-tables Formal verification model C reference model ? = Model checking Simulation Find “easy” bugs in protocol spec Find “hard” bugs in protocol spec Find bugs in implementations
Properties verified • Data consistency: • If a cache’s state is valid (i.e., S, E, or M), then its data is up to date • Cache and directory state consistency: • If any cache is in state E or M, the other caches must be in I • If a presence bit in directory is 0, the corresponding cache must be in I • If the directory state is I, all presence bits are 0 (and hence all caches are I) • If the directory state is S, the caches whose presence bits are 1 are in I or S • If the directory state is E, there is exactly one presence bit being 1 • Weak liveness properties: AG EF (cs = CS), for each “control state” cs and each possible value CS of cs • Excellent guard against missing rows in protocol tables and other unexpected cases • Detect both global and local deadlocks, but not livelock or starvation • Do not rely on fairness assumptions
Results of SP cache coherence protocol FV • An SP cache coherence protocol has >10^33 reachable states for a configuration containing 1 cache-line address, 1 home node, 2 caching nodes, and all >20 transaction types • Each property takes (on the average) 4~5 hours to model-check on a 700 MHz Pentium III Xeon machine with 4 GB of physical memory • Many interesting bugs were found in successive versions of SP cache coherence protocol by both formal modeling and model checking • In fact, more bugs were found by the former than by the latter in the early phase of SP protocol design • Not surprisingly, most problems were found when SP was first designed and during major revisions (e.g., when new transaction types were added) • But even minor revisions could introduce problems • Moral: As far as cache coherence protocols are concerned, unaided human reasoning should not be trusted
Rule-based table checking flow Specification document in word processor (e.g,. FrameMaker) Specification document in HTML Convert Extract & flatten Rules Pre-processed table Post-process Generate ? = Generated table Post-processed table
Why rule-based table checking works • Tables and rules take two fundamentally different but complementary views: • Tables are row-centric and enumeration of cases (row = case) • Rules are column-centric and expression of relationships between columns • By comparing the two views against each other, the chance of a bug escaping is minimized • Ideally, tables and rules should be constructed by two different persons • Expression of complex relationships between visible columns is simplified by means of hidden columns • “Cause-and-effect” metaphor: Hidden columns are the “ultimate” but invisible “causes” of visible columns • Hidden columns are hidden by existentially quantifying them away • Hidden columns are used to increase further the difference of the two views
Results of rule-based table checking • Coded boolean rules for SP protocol tables and checked them against each other • Typically dozens of errors were found before tables and rules agree • Most errors were trivial (e.g., typos), but some were more serious (e.g., missing cases or systematic misunderstanding) • Maintained the agreement between tables and rules over 2 years and tens of major and minor protocol revisions • Changing rules to keep up with tables almost never required more efforts than changing tables themselves • Rule-based table checking is our first line of defense, flushes out virtually all “easy” bugs, and has very low computational overhead • It takes < 5 minutes to extract and verify by rules all SP protocol tables • We are not advocating that “code review” of tables be eliminated • “Code review” is still a must at the beginning • We do advocate that insights from “code review” be captured and codified by rules and re-used later when tables are changed
Novel applications of binary decision diagrams • Rule-based table generation and checking • Boils down to enumerating satisfying “truth” assignments of boolean expressions over enumerated types • Search for minimal deadlock-free wormhole routing scheme • A wormhole routing scheme is deadlock-free Its channel dependency graph is acyclic The transitive closure of the graph contains no self-loop • Hence reducible to BDD fixpoint computation • Details in our FMSD paper • Search for fault-tolerant link initialization sequences • Details in our FMSD paper • Observations: • “Formal methods thinking” leads to new ways of looking at old problems • A little BDD goes a long way: • BDD is an efficient representation of boolean rules (1. above) • BDD supports exhaustive search of a “solution space” (2. & 3. above)
Lessons learned • Formal modeling steered us toward more precise and concrete protocol specifications than we would have written without it • Even an abstract formal model requires one to spell out what exactly one means by each protocol structure and action • Formal modeling also turned out to be an excellent way to help architects articulate their ideas • Formal verification gave us much higher confidence in the correctness of our protocol specifications than we would have without it • Certain distributed protocols (e.g., directory-based cache coherence protocols) are too complex for unaided human reasoning alone to get correct • Formal verification makes it less risky to modify protocol specifications • Architecture definition affords a rich and fruitful area for the application of formal methods • Avoid state explosion with a high level of abstraction • Get to bugs at the earliest possible stage • Encourage architects to choose more “validation-friendly” schemes • Applying formal methods to a specification enables the exploration of design spaces that are beyond the scope of any particular implementation • Especially important for Intel, which defines architectures that will be implemented by multiple vendors over multiple product generations
Acknowledgements • Mani Azimi, Jay Jayasimha, Akhilesh Kumar, Victor W. Lee, Phanindra K. Mannava, Seungjoon Park, and Aniruddha Vaidya all contributed to the work described above.