Ten Minutes on Five Nines

Ten Minutes on Five Nines Terry Gray Associate VP, IT Infrastructure University of Washington <a last minute recruit for…> Common PROBLEMS Group 6 January 2005

Vision • Systems/Services (and Staff!) characterized as Reliable and Responsive • Reliability = job one • But: I.T. = Inevitable Tensions • We all want: • High MTTF, Performance and Function • Low MTTR and support cost • The art is to balance those conflicting goals • we are jugglers and technology actuaries

Tom’s Nobody gets hurt Nobody goes to jail Terry’s “Works fine, lasts a long time” Low ROI (Risk Of Interruption) Success Metrics

Fault Zone size vs. Economy/Simplicity Reliability vs. Complexity Prevention vs. (Fast) Remediation Security vs. Supportability vs. Functionality Networks = Connectivity; Security = Isolation Balancing priorities (security vs. ops vs. function) Design Tradeoffs

Context: A Perfect Storm • Increased dependency on I.T. • Decreased tolerance for outages • Deferred maintenance • Inadequate infrastructure investment • Some extraordinarily fragile applications • Fragmented host management • Increasingly hostile network environment • esp. spam, spyware, social engr attacks • Increasing legal/regulatory liability • Highly de-centralized culture • Growth of portable devices

Environmentals (Power, A/C, Physical Security) Network Client Workstations (incl. portable devices) Servers Applications Personnel, Procedures, Policy, and ArchitectureFailures at one level can trigger problems at another level; need Total System perspective System Elements

How often is there a user-visible failure? How many people are affected? For how long? How severely? Dimensions

How many nines? Problem one: what to measure? How do you reduce behavior of a complex net to a single number? Difficult for either uptime or utilization metrics Problem two: data networks are not like phone or power services… Imagine if phones could assume anyone’s number Or place a million calls per second! Basics

Obviously lack of security is bad… but: Defense in depth is not free Each add’l defensive perimeter increases MTTR Defense-in-depth conjecture (for N layers) Security: MTTE (exploit)  N**2 Functionality: MTTI (innovation)  N**2 Supportability: MTTR (repair)  N**2 Next-gen threats: firewalls won’t help Security vs. Reliability

How do you measure avail in complex systems? Death of the Network Utility Model Organizational vs. geographic networking SAN virtualization Web load-leveler appliances Organizational boundary conditions Networks: from stochastic to non-deterministic Subnets with clients and critical servers Documentation deficiencies Complexity vs. Reliability

Jan 2004 (?) IEEE Spectrum on Power Grid failures Point: it will happen, so plan for mitigation Complex System Failures: Inevitable?

New trouble-ticket system New network management system Next-generation network architecture Next-generation security architecture Improving change control process Improving DRBR process Lots of work on improving mon/diag tools Work in Progress

In Short… • Expectations are growing (unrealistically?) • Complexity is growing • Few are prepared to pay for true HA • Cultural barriers to change control • Hospitals are a whole other world • Biggest SPoF: power/HVAC • Organizational complexity undermines HA • Both security and lack of it undermine HA • Redundancy can mask failures too well! • With redundancy, must have better tools • Need Ops-centric design, better DRBR • Need application procurement standards

Questions? Comments?

Ten Minutes on Five Nines