The 100 Day Time B omb

The 100 Day Time Bomb

Approximately 100 days after first shipment of a new Mission Critical server multiple customers experienced a complete system outage. Well, really 99.4, but who’s counting?

I’m Not Dead Yet! • Mission Critical Computing • Five 9’s availability • Comprehensive redundancy • No or few SS-SPOF • No MS-SPOF • Strong will to live

SuhSuhSpof - Whut? • SS-SPOF: Single Server Single Point of Failure • As few entities which can fail as possible which will cause a server to require a restart • Complete CPU failure • Catastrophic IO errors

MS-SPOF: Multi-Server Single Point of Failure • No single failure should result in a complete system outage • Heavy investment to identify and prevent • Primarily hardware focused Dun! Dun! Duuuun!

I got your back! • Elaborate management systems to detect, log, and correct faults to support the strong will to live • Embedded processors on nearly every board • Guidelines • Management faults should not impact servers • Soldier on in the face of errors • Do no harm

What could possibly go wrong?

To Infinity and Beyond! • Linux Kernel Defect • A poll(-1) would timeout after 99.4 days • Fixed about 4 years earlier • But only in newer kernels • Older kernel reused

It was an itsy bitsy teeny weeny. . . [PATCH] fs: sys_poll with timeout -1 bug fix If you do a poll() call with timeout -1, the wait will be a big number (depending on HZ) instead of infinite wait, since -1 is passed to the msecs_to_jiffies function. • A 5 line fix resolved it

This cannot be happening! • A subsystem which used an infinite poll() would abort if the poll returned an error. • Due to the defect the poll would return E_TIMED_OUT

Bailout! • Management processors will restart, rather than attempt to recover from subsystem failure • Predictable initialization sequence • Lower demand on validation

Did that really hurt? • Poll times out • Management processor restarts • Does not impact server • No harm done?

There’s a little more to it • Another subsystem needs to initialize hardware at AC power on • No AC power on signal provided in hardware • They chose to identify AC on by voting when the management processors restarted • The current power state of the server was not taken into account

Watch the dominoes fall • System installed at customer site • All boards would hit the bug 99.4 days later • All boards would reset • AC Power on detected (incorrectly) • System state reinitialized, including power • All servers crash: MS-SPOF

Should have seen that coming! • Guidelines not translated to clear requirements • Long term test system restarted at 90+ days • If only, if only, if only

Ooh, that’s gonna leave a mark!

The 100 Day Time B omb

The 100 Day Time B omb

Presentation Transcript

OMB Circular A-123, Appendix B Update

100 Day Loans Review

The 100 th Day of School

OMB Update

OMB Update

OMB Guidance

Time series of the day

Day 100 Objectives

Day time night time

GEOG 100 – Day 12

GEOG 100: Day 14

OMB Circulars

Happy B-day

Engaging Students 100% of the Time

100 Day Celebration

B-737-100/200

100 Day Report

The Time B efore the Internet

HAPPY B-DAY!

Happy 100 th Day!

Time 100 Gala