1 / 14

Ten Minutes on Five Nines

Ten Minutes on Five Nines. Terry Gray Associate VP, IT Infrastructure University of Washington <a last minute recruit for…> Common PROBLEMS Group 6 January 2005. Vision. Systems/Services (and Staff!) characterized as Reliable and Responsive Reliability = job one

deidra
Download Presentation

Ten Minutes on Five Nines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ten Minutes on Five Nines Terry Gray Associate VP, IT Infrastructure University of Washington <a last minute recruit for…> Common PROBLEMS Group 6 January 2005

  2. Vision • Systems/Services (and Staff!) characterized as Reliable and Responsive • Reliability = job one • But: I.T. = Inevitable Tensions • We all want: • High MTTF, Performance and Function • Low MTTR and support cost • The art is to balance those conflicting goals • we are jugglers and technology actuaries

  3. Tom’s Nobody gets hurt Nobody goes to jail Terry’s “Works fine, lasts a long time” Low ROI (Risk Of Interruption) Success Metrics

  4. Fault Zone size vs. Economy/Simplicity Reliability vs. Complexity Prevention vs. (Fast) Remediation Security vs. Supportability vs. Functionality Networks = Connectivity; Security = Isolation Balancing priorities (security vs. ops vs. function) Design Tradeoffs

  5. Context: A Perfect Storm • Increased dependency on I.T. • Decreased tolerance for outages • Deferred maintenance • Inadequate infrastructure investment • Some extraordinarily fragile applications • Fragmented host management • Increasingly hostile network environment • esp. spam, spyware, social engr attacks • Increasing legal/regulatory liability • Highly de-centralized culture • Growth of portable devices

  6. Environmentals (Power, A/C, Physical Security) Network Client Workstations (incl. portable devices) Servers Applications Personnel, Procedures, Policy, and ArchitectureFailures at one level can trigger problems at another level; need Total System perspective System Elements

  7. How often is there a user-visible failure? How many people are affected? For how long? How severely? Dimensions

  8. How many nines? Problem one: what to measure? How do you reduce behavior of a complex net to a single number? Difficult for either uptime or utilization metrics Problem two: data networks are not like phone or power services… Imagine if phones could assume anyone’s number Or place a million calls per second! Basics

  9. Obviously lack of security is bad… but: Defense in depth is not free Each add’l defensive perimeter increases MTTR Defense-in-depth conjecture (for N layers) Security: MTTE (exploit)  N**2 Functionality: MTTI (innovation)  N**2 Supportability: MTTR (repair)  N**2 Next-gen threats: firewalls won’t help Security vs. Reliability

  10. How do you measure avail in complex systems? Death of the Network Utility Model Organizational vs. geographic networking SAN virtualization Web load-leveler appliances Organizational boundary conditions Networks: from stochastic to non-deterministic Subnets with clients and critical servers Documentation deficiencies Complexity vs. Reliability

  11. Jan 2004 (?) IEEE Spectrum on Power Grid failures Point: it will happen, so plan for mitigation Complex System Failures: Inevitable?

  12. New trouble-ticket system New network management system Next-generation network architecture Next-generation security architecture Improving change control process Improving DRBR process Lots of work on improving mon/diag tools Work in Progress

  13. In Short… • Expectations are growing (unrealistically?) • Complexity is growing • Few are prepared to pay for true HA • Cultural barriers to change control • Hospitals are a whole other world • Biggest SPoF: power/HVAC • Organizational complexity undermines HA • Both security and lack of it undermine HA • Redundancy can mask failures too well! • With redundancy, must have better tools • Need Ops-centric design, better DRBR • Need application procurement standards

  14. Questions? Comments?

More Related