160 likes | 367 Views
The 100 Day Time B omb. Approximately 100 days after first shipment of a new Mission Critical server multiple customers experienced a complete system outage. Well, really 99.4, but who’s counting?. I’m Not Dead Yet!. Mission Critical Computing Five 9’s availability Comprehensive redundancy
E N D
Approximately 100 days after first shipment of a new Mission Critical server multiple customers experienced a complete system outage. Well, really 99.4, but who’s counting?
I’m Not Dead Yet! • Mission Critical Computing • Five 9’s availability • Comprehensive redundancy • No or few SS-SPOF • No MS-SPOF • Strong will to live
SuhSuhSpof - Whut? • SS-SPOF: Single Server Single Point of Failure • As few entities which can fail as possible which will cause a server to require a restart • Complete CPU failure • Catastrophic IO errors
MS-SPOF: Multi-Server Single Point of Failure • No single failure should result in a complete system outage • Heavy investment to identify and prevent • Primarily hardware focused Dun! Dun! Duuuun!
I got your back! • Elaborate management systems to detect, log, and correct faults to support the strong will to live • Embedded processors on nearly every board • Guidelines • Management faults should not impact servers • Soldier on in the face of errors • Do no harm
To Infinity and Beyond! • Linux Kernel Defect • A poll(-1) would timeout after 99.4 days • Fixed about 4 years earlier • But only in newer kernels • Older kernel reused
It was an itsy bitsy teeny weeny. . . [PATCH] fs: sys_poll with timeout -1 bug fix If you do a poll() call with timeout -1, the wait will be a big number (depending on HZ) instead of infinite wait, since -1 is passed to the msecs_to_jiffies function. • A 5 line fix resolved it
This cannot be happening! • A subsystem which used an infinite poll() would abort if the poll returned an error. • Due to the defect the poll would return E_TIMED_OUT
Bailout! • Management processors will restart, rather than attempt to recover from subsystem failure • Predictable initialization sequence • Lower demand on validation
Did that really hurt? • Poll times out • Management processor restarts • Does not impact server • No harm done?
There’s a little more to it • Another subsystem needs to initialize hardware at AC power on • No AC power on signal provided in hardware • They chose to identify AC on by voting when the management processors restarted • The current power state of the server was not taken into account
Watch the dominoes fall • System installed at customer site • All boards would hit the bug 99.4 days later • All boards would reset • AC Power on detected (incorrectly) • System state reinitialized, including power • All servers crash: MS-SPOF
Should have seen that coming! • Guidelines not translated to clear requirements • Long term test system restarted at 90+ days • If only, if only, if only