260 likes | 364 Views
Dynamic Verification of End-to-End Multiprocessor Invariants. Daniel J. Sorin 1 , Mark D. Hill 2 , David A. Wood 2 1 Department of Electrical & Computer Engineering Duke University 2 Computer Sciences Department University of Wisconsin-Madison. My Talk in One Slide.
E N D
Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A. Wood2 1Department of Electrical & Computer Engineering Duke University 2Computer Sciences Department University of Wisconsin-Madison
My Talk in One Slide • Commercial server availability is important • System model: Symmetric Multiprocessor (SMP) • Fault model: Mostly transient, some permanent • Recent work developed efficient checkpoint/recovery • But we can only recover from hardware errors we detect • Many hardware errors are hard to detect • Proposal: Dynamic verification of invariants • Online checking of end-to-end system invariants • Checking performed with distributed signature analysis • Triggers recovery if invariant is violated
Outline • Background • SMPs and availability • Existing hardware error detection • Invariant checking with distributed signature analysis • Two invariant checkers • Evaluation • Conclusions
shared wire bus P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I M Issue request Wait for response Receive response
switch switch switch P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I M Issue request Wait for response Receive response • Broadcast request not delivered to subset of nodes • Broadcast requests delivered out of order to subset of nodes
switch switch • Broadcast request not delivered to subset of nodes • Broadcast requests delivered out of order to subset of nodes switch P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction request arrives response arrives t2 I t1 M issue request t3 request arrives response arrives • More chances for incorrect state transitions
Backward Error Recovery • Can improve availability with backward error recovery • If error detected, then recover to pre-fault state • Backward error recovery (BER) requires: • Checkpoint/recovery mechanism • Error detection mechanisms
SafetyNet Checkpoint/Recovery • SafetyNet: all-hardware scheme [ISCA 2002] • Periodically take logical checkpoint of multiprocessor • MP State: processor registers, caches, memory • Incrementally log changes to caches and memory • Consistent checkpointing performed in logical time • E.g., every 3000 broadcast cache coherence requests • Can tolerate >100,000 cycles of error detection latency CP 1 CP 2 CP 3 CP 4 Validated execution Pending validation – Still detecting errors Active execution time
Error Detection • Error model: mostly due to transient faults • Example error detection mechanisms: • Parity bit on cache line • Checksum on incoming message • Timeout on cache coherence transaction • But error detection for servers is still weak • Why? • Error detection is often on critical path and must be fast • Fast error detection can’t incorporate info from other nodes
switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Shared Owned
switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Broadcast Request for Exclusive fault! Shared Owned
switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Broadcast Request for Exclusive fault! Shared Owned Invalid Data Response
switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Modified Shared • Neither P1 nor P2 can detect that an error has occurred!
Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions
Distributed Signature Analysis • Reduces long history of events into small signature • Signatures map almost-uniquely to event histories Event N at P1 : Event 2 at P1 Event 1 at P1 Event N at P2 : Event 2 at P2 Event 1 at P2 P1 P2 Signature Signature P1’s signature P2’s signature } Check periodically in logical time (every 3000 requests) Checker
Designing Signature Analysis Schemes • Must devise two functions: Update and Check • Signature(Pi) = Update(Signature(Pi), Event) • Check(Signature(P1),…,Signature(PN)) = true if error • Simple example: check that message inflow=outflow • Assume only unicast messages • Update: +1 for receive, -1 for send • Check: true if sum of all signatures doesn’t equal 0
Implementing Distributed Signature Analysis • All components cooperate to perform checking • Component = cache controller or memory controller • Each component contains: • Local signature register • Logic to compute signature updates • System contains: • System controller that performs check function • Use distributed signature analysis for dynamic verification • Verify end-to-end invariants
Outline • Background • End-to-end invariant checking • Two invariant checkers • Message invariant • Cache coherence invariant • Evaluation • Conclusions
A Message-Level Invariant Checker • Context: symmetric multiprocessor (SMP) • Cache coherence with broadcast snooping protocol • Invariant: all nodes see same total order of broadcast cache coherence requests • Update: for each incoming broadcast, “add” Address • Not quite this simple (e.g., doesn’t detect reorderings) • Check: error if all signatures aren’t equal
Aliasing • Aliasing occurs if two histories have same signature • 3 possible sources of aliasing • Finite resources – b bits can only distinguish 2b histories • Fault in signature analysis hardware itself • Inherent flaw in scheme • Examples of inherent aliasing in previous scheme • Arrival of message with Address=0 doesn’t change signature • Reordering of messages doesn’t change signature • We solve aliasing issues in paper • Tricks: hash more than 1 field of message, use LFSRs, etc.
A Cache Coherence Invariant Checker • Invariant: all coherence upgrades cause downgrades • Upgrade: increase permissions to block (e.g., noneread) • Downgrade: decrease permissions (e.g., write read) • Update: add Address for upgrade subtract Address for downgrade • Check: error if sum of all signatures doesn’t equal 0 • Challenges • Can be more than one downgrade per upgrade • Upgrader doesn’t know how how many downgraders exist • See paper for solutions to these challenges
Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions
Methodology • Full-system simulation of 16-processor machine • Simics provides functional simulation of everything • We added timing simulation for memory system & SafetyNet • Commercial workloads running on Solaris 8 • Database: IBM’s DB2 running online transaction processing • Static web server: Apache • Dynamic web server: Slashdot • Java middleware
Detection Coverage • How do we know if our checkers work? • Inject errors periodically • Corrupt messages • Drop messages • Reorder messages • Improperly process cache coherence messages Global invariant checkers detected all errors
Performance • Error bars represent +/- one standard deviation
Conclusions • Goal: improve multiprocessor availability • How? Dynamic verification of end-to-end invariants • Implemented with distributed signature analysis • Results • Detects previously undetectable hardware errors • Negligible performance overhead for error-free execution • Duke FaultFinder Project • http://www.ee.duke.edu/~sorin/faultfinder • Wisconsin Multifacet Project • http://www.cs.wisc.edu/multifacet/