1 / 16

The 100 Day Time B omb

The 100 Day Time B omb. Approximately 100 days after first shipment of a new Mission Critical server multiple customers experienced a complete system outage. Well, really 99.4, but who’s counting?. I’m Not Dead Yet!. Mission Critical Computing Five 9’s availability Comprehensive redundancy

Download Presentation

The 100 Day Time B omb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 100 Day Time Bomb

  2. Approximately 100 days after first shipment of a new Mission Critical server multiple customers experienced a complete system outage. Well, really 99.4, but who’s counting?

  3. I’m Not Dead Yet! • Mission Critical Computing • Five 9’s availability • Comprehensive redundancy • No or few SS-SPOF • No MS-SPOF • Strong will to live

  4. SuhSuhSpof - Whut? • SS-SPOF: Single Server Single Point of Failure • As few entities which can fail as possible which will cause a server to require a restart • Complete CPU failure • Catastrophic IO errors

  5. MS-SPOF: Multi-Server Single Point of Failure • No single failure should result in a complete system outage • Heavy investment to identify and prevent • Primarily hardware focused Dun! Dun! Duuuun!

  6. I got your back! • Elaborate management systems to detect, log, and correct faults to support the strong will to live • Embedded processors on nearly every board • Guidelines • Management faults should not impact servers • Soldier on in the face of errors • Do no harm

  7. What could possibly go wrong?

  8. To Infinity and Beyond! • Linux Kernel Defect • A poll(-1) would timeout after 99.4 days • Fixed about 4 years earlier • But only in newer kernels • Older kernel reused

  9. It was an itsy bitsy teeny weeny. . . [PATCH] fs: sys_poll with timeout -1 bug fix If you do a poll() call with timeout -1, the wait will be a big number (depending on HZ) instead of infinite wait, since -1 is passed to the msecs_to_jiffies function. • A 5 line fix resolved it

  10. This cannot be happening! • A subsystem which used an infinite poll() would abort if the poll returned an error. • Due to the defect the poll would return E_TIMED_OUT

  11. Bailout! • Management processors will restart, rather than attempt to recover from subsystem failure • Predictable initialization sequence • Lower demand on validation

  12. Did that really hurt? • Poll times out • Management processor restarts • Does not impact server • No harm done?

  13. There’s a little more to it • Another subsystem needs to initialize hardware at AC power on • No AC power on signal provided in hardware • They chose to identify AC on by voting when the management processors restarted • The current power state of the server was not taken into account

  14. Watch the dominoes fall • System installed at customer site • All boards would hit the bug 99.4 days later • All boards would reset • AC Power on detected (incorrectly) • System state reinitialized, including power • All servers crash: MS-SPOF

  15. Should have seen that coming! • Guidelines not translated to clear requirements • Long term test system restarted at 90+ days • If only, if only, if only

  16. Ooh, that’s gonna leave a mark!

More Related