220 likes | 418 Views
Heavy-ion Fault Injections in the Time-triggered Communication Protocol. Håkan Sivencrona, SP Per Johannessen, Volvo Car Corporation Mattias Persson & Jan Torin, Chalmers University of Technology. Agenda. Objective Time-triggered Protocol Membership Agreement Communication Failures
E N D
Heavy-ion Fault Injections in the Time-triggered Communication Protocol Håkan Sivencrona, SP Per Johannessen, Volvo Car Corporation Mattias Persson & Jan Torin, Chalmers University of Technology
Agenda • Objective • Time-triggered Protocol • Membership Agreement • Communication Failures • Heavy-ion Fault Injections • Experimental Set-up • Results • Discussion • Conclusions LADC 2003 São Paulo, Brazil
Objective • Validate the fault hypothesis and fault handling mechanisms of a specific implementation of TTP/C • Use results for improvements of TTP/C and time-triggered systems in general • To gain experience with safety-critical broadcast buses using FI-techniques • Explore new failure modes of time-triggered communication LADC 2003 São Paulo, Brazil
Time-Triggered Protocol • Time Division Multiple Access, TDMA For safety-critical applications • Fault tolerance is mainly implemented as redundant hardware and software mechanisms • Fault Hypothesis: tolerate any single fault • Services: • Deterministic message sending • Clock synchronization • Membership service • Clique avoidance LADC 2003 São Paulo, Brazil
Membership Agreement • Gives a consistent system state • All nodes have a membership vector • The cluster’s membership vector includes the nodes that have the same global state • Every node is represented by a unique bit in the vectors in all nodes LADC 2003 São Paulo, Brazil
Communication Failures • A node stops transmitting messages • Application fault • Controller crash/failure • A message interference in the physical layer • Permanent or temporary persistent • Transient • An asymmetric message interpretation • Byzantine • Omission inconsistent • … and the system behavior depends on the application LADC 2003 São Paulo, Brazil
Heavy-ion Fault Injection • Californium 252 source which radiates heavy-ions with high energy, >> 1 MeV • Causes so-called single event upsets, SEUs, and other effects in the CMOS device • Can affect locations not accessible with other methods • Only statistically reproducible • Low controllability • … LADC 2003 São Paulo, Brazil
Experimental Set-up System with 4-9 nodes with similar message schedules Software that monitors and detects discrepancies LADC 2003 São Paulo, Brazil
Fault Injection Results • Null Frame – No transmission, eg. Fail Silence • Checksum Errors, CRC, Message has the right format but wrong content • Invalid Frame, A message that may or may not be readable but not valid to use • In time domain • In value domain • Time discrepancies, when times are close to the unacceptable LADC 2003 São Paulo, Brazil
CNI-register Error Log Files Error diagnosis field Invalid frame flagged Correct frame received LADC 2003 São Paulo, Brazil
Example of Logged Data LADC 2003 São Paulo, Brazil
Results Fail Silence Violations • Approximately 12 % of all faults were undetected by the FI-node resulting in a fail silence violation • More than 90% of these were CRC faults • The rest were invalid frames • Approximately 0.1 % of all faults were SOS messages, mainly invalid frames in the time domain LADC 2003 São Paulo, Brazil
Fault Injection Results in Cluster • A node stops transmitting messages • The FI-node is silent • Message Interference • Babbling idiot, needed manual reset of the system • Reintegration • Asymmetric interpretation of messages • Asymmetric timing faults – SOS faults in time domain • Asymmetric value faults - SOS faults in value domain • … and the system behavior depends on the protocol implementation and the application LADC 2003 São Paulo, Brazil
Asymmetric value failure scenario LADC 2003 São Paulo, Brazil
Asymmetric timing failure scenario Time deviation in microtics from own clock Node 7 Node 2 LADC 2003 São Paulo, Brazil
Cluster Size Comparisons LADC 2003 São Paulo, Brazil
Concerns Membership vs. Asymmetry • Faulty node remains undetected in case of SOS faults • Applications within the minority partition – system safety? • Protocol membership gives a brittle system • Reintegration – a possible hazard LADC 2003 São Paulo, Brazil
Discussion Application Application Communication Protocol Communication Protocol Physical layer Physical layer • Active star coupler • Modified membership agreement protocol • Algorithms to detect and handle SOS failures Dependability increase Membership Membership LADC 2003 São Paulo, Brazil
Conclusions TTP/C • Partitioning due to asymmetric faults should be resolved smoother and maybe not by forced reintegration • Stronger fault containment regions are needed • Larger system/cluster more resilient against SOS faults LADC 2003 São Paulo, Brazil
General Conclusions • Heavy-ion fault injection is efficient in stressing silicon designs to arbitrary failure modes • High-integrity systems must handle asymmetric and Byzantine faults • Coverage against arbitrary faults is the only realistic approach for safety critical systems but difficult to achieve LADC 2003 São Paulo, Brazil
Questions?Thank you for listening! LADC 2003 São Paulo, Brazil