Dynamic Verification of End-to-End Multiprocessor Invariants

Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A. Wood2 1Department of Electrical & Computer Engineering Duke University 2Computer Sciences Department University of Wisconsin-Madison

My Talk in One Slide • Commercial server availability is important • System model: Symmetric Multiprocessor (SMP) • Fault model: Mostly transient, some permanent • Recent work developed efficient checkpoint/recovery • But we can only recover from hardware errors we detect • Many hardware errors are hard to detect • Proposal: Dynamic verification of invariants • Online checking of end-to-end system invariants • Checking performed with distributed signature analysis • Triggers recovery if invariant is violated

Outline • Background • SMPs and availability • Existing hardware error detection • Invariant checking with distributed signature analysis • Two invariant checkers • Evaluation • Conclusions

shared wire bus P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I M Issue request Wait for response Receive response

switch switch switch P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I M Issue request Wait for response Receive response • Broadcast request not delivered to subset of nodes • Broadcast requests delivered out of order to subset of nodes

switch switch • Broadcast request not delivered to subset of nodes • Broadcast requests delivered out of order to subset of nodes switch P P P P Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction request arrives response arrives t2 I t1 M issue request t3 request arrives response arrives • More chances for incorrect state transitions

Backward Error Recovery • Can improve availability with backward error recovery • If error detected, then recover to pre-fault state • Backward error recovery (BER) requires: • Checkpoint/recovery mechanism • Error detection mechanisms

SafetyNet Checkpoint/Recovery • SafetyNet: all-hardware scheme [ISCA 2002] • Periodically take logical checkpoint of multiprocessor • MP State: processor registers, caches, memory • Incrementally log changes to caches and memory • Consistent checkpointing performed in logical time • E.g., every 3000 broadcast cache coherence requests • Can tolerate >100,000 cycles of error detection latency CP 1 CP 2 CP 3 CP 4 Validated execution Pending validation – Still detecting errors Active execution time

Error Detection • Error model: mostly due to transient faults • Example error detection mechanisms: • Parity bit on cache line • Checksum on incoming message • Timeout on cache coherence transaction • But error detection for servers is still weak • Why? • Error detection is often on critical path and must be fast • Fast error detection can’t incorporate info from other nodes

switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Shared Owned

switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Broadcast Request for Exclusive fault! Shared Owned

switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Broadcast Request for Exclusive fault! Shared Owned Invalid Data Response

switch switch switch P1 P2 P3 P4 Why Local Information Isn’t Sufficient Modified Shared • Neither P1 nor P2 can detect that an error has occurred!

Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions

Distributed Signature Analysis • Reduces long history of events into small signature • Signatures map almost-uniquely to event histories Event N at P1 : Event 2 at P1 Event 1 at P1 Event N at P2 : Event 2 at P2 Event 1 at P2 P1 P2 Signature Signature P1’s signature P2’s signature } Check periodically in logical time (every 3000 requests) Checker

Designing Signature Analysis Schemes • Must devise two functions: Update and Check • Signature(Pi) = Update(Signature(Pi), Event) • Check(Signature(P1),…,Signature(PN)) = true if error • Simple example: check that message inflow=outflow • Assume only unicast messages • Update: +1 for receive, -1 for send • Check: true if sum of all signatures doesn’t equal 0

Implementing Distributed Signature Analysis • All components cooperate to perform checking • Component = cache controller or memory controller • Each component contains: • Local signature register • Logic to compute signature updates • System contains: • System controller that performs check function • Use distributed signature analysis for dynamic verification • Verify end-to-end invariants

Outline • Background • End-to-end invariant checking • Two invariant checkers • Message invariant • Cache coherence invariant • Evaluation • Conclusions

A Message-Level Invariant Checker • Context: symmetric multiprocessor (SMP) • Cache coherence with broadcast snooping protocol • Invariant: all nodes see same total order of broadcast cache coherence requests • Update: for each incoming broadcast, “add” Address • Not quite this simple (e.g., doesn’t detect reorderings) • Check: error if all signatures aren’t equal

Aliasing • Aliasing occurs if two histories have same signature • 3 possible sources of aliasing • Finite resources – b bits can only distinguish 2b histories • Fault in signature analysis hardware itself • Inherent flaw in scheme • Examples of inherent aliasing in previous scheme • Arrival of message with Address=0 doesn’t change signature • Reordering of messages doesn’t change signature • We solve aliasing issues in paper • Tricks: hash more than 1 field of message, use LFSRs, etc.

A Cache Coherence Invariant Checker • Invariant: all coherence upgrades cause downgrades • Upgrade: increase permissions to block (e.g., noneread) • Downgrade: decrease permissions (e.g., write  read) • Update: add Address for upgrade subtract Address for downgrade • Check: error if sum of all signatures doesn’t equal 0 • Challenges • Can be more than one downgrade per upgrade • Upgrader doesn’t know how how many downgraders exist • See paper for solutions to these challenges

Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions

Methodology • Full-system simulation of 16-processor machine • Simics provides functional simulation of everything • We added timing simulation for memory system & SafetyNet • Commercial workloads running on Solaris 8 • Database: IBM’s DB2 running online transaction processing • Static web server: Apache • Dynamic web server: Slashdot • Java middleware

Detection Coverage • How do we know if our checkers work? • Inject errors periodically • Corrupt messages • Drop messages • Reorder messages • Improperly process cache coherence messages Global invariant checkers detected all errors

Performance • Error bars represent +/- one standard deviation

Conclusions • Goal: improve multiprocessor availability • How? Dynamic verification of end-to-end invariants • Implemented with distributed signature analysis • Results • Detects previously undetectable hardware errors • Negligible performance overhead for error-free execution • Duke FaultFinder Project • http://www.ee.duke.edu/~sorin/faultfinder • Wisconsin Multifacet Project • http://www.cs.wisc.edu/multifacet/

Dynamic Verification of End-to-End Multiprocessor Invariants

Dynamic Verification of End-to-End Multiprocessor Invariants

Presentation Transcript

About Dynamic Sites (Front End / Back End Implementations)

End to End Protocols

End-to-End Issues

End to End Simulations

End-to-End Protocols

End-To-End Scheduling

End to End Protocols

End-to-end Authorization

End to End Quality of Service

End-to-End Stewardship

End-to-End Protocols

Dynamic Management for End-to-end IP QoS

End-to-End Data

End-to-end eProcurement

End-to-End Protocols

End-to-End Simulation

End to End Protocols

End To End Encryption

End-to-end makeover