Dependability in the Internet Era

Dependability in the Internet Era

Outline • The glorious past (Availability Progress) • The dark ages (current scene) • Some recommendations

PreviewThe Last 5 Years: Availability Dark AgesReady for a Renaissance? • Things got better, then things got a lot worse! 99.999% Telephone Systems 99.999% 99.99% Availability Cell phones 99.9% Computer Systems 99% Internet 9% 1950 1960 1970 1980 1990 2000

DEPENDABILITY: The 3 ITIES • RELIABILITY / INTEGRITY: Does the right thing.(also MTTF>>1) • AVAILABILITY: Does it now. (also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?(=>89% of transactions are serviced on time). • Holistic vs. Reductionist view Security Integrity Reliability Availability

Fail-Fast is Good, Repair is Needed Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability ~ MTTR MTTF Improving either MTTR or MTTF gives benefit Simple redundancy does not help much.

Fault Model • Failures are independentSo, single fault tolerance is a big win • Hardware fails fast (dead disk, blue-screen) • Software fails-fast (or goes to sleep) • Software often repaired by reboot: • Heisenbugs • Operations tasks: major source of outage • Utility operations • Software upgrades

Disks (raid) the BIG Success Story • Duplex or Parity: masks faults • Disks @ 1M hours (~100 years) • But • controllers fail and • have 1,000s of disks. • Duplexing or parity, and dual path gives “perfect disks” • Wal-Mart never lost a byte (thousands of disks, hundreds of failures). • Only software/operations mistakes are left.

Fault Tolerance vs Disaster Tolerance • Fault-Tolerance: mask local faults • RAID disks • Uninterruptible Power Supplies • Cluster Failover • Disaster Tolerance: masks site failures • Protects against fire, flood, sabotage,.. • Redundant system and service at remote site.

Case Study - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor Vendor (hardware and software) 5 Months Application software 9 Months Communications lines 1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 4 2 % Tele Comm lines 1 2 % 1 1 . 2 Environment % 2 5 % Application Software 9 . 3 % Operations

Case Studies - Tandem Trends MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software

Dependability Status circa 1995 • ~4-year MTTF => 5 9s for well-managed sys. Fault Tolerance Works. • Hardware is GREAT (maintenance and MTTF). • Software masks most hardware faults. • Many hidden software outages in operations: • New Software. • Utilities. • Make all hardware/software changes ONLINE. • Software seems to define a 30-year MTTF ceiling. • Reasonable Goal: 100-year MTTF. class 4 today=>class 6 tomorrow.

What’s Happened Since Then? • Hardware got better • Software got better (even though it is more complex) • Raid is standard, Snapshots coming standard • Cluster in a box: commodity failover • Remote replication is standard.

9 9 9 9 9 Availability Un-managed Availability well-managed nodes Masks some hardware failures well-managed packs & clones Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures well-managed GeoPlex Masks site failures (power, network, fire, move,…) Masks some operations failures

Progress? • MTTF improved from 1950-1995 • MTTR has not improved much since 1970 failover • Hardware and Software online change (pNp) is now standard • Then the Internet arrived: • No project can take more than 3 months. • Time to market is everything • Change is good.

1990 Phones delivered 99.999% ATMs delivered 99.99% Failures were front-page news. Few hackers Outages last an “hour” 2000 Cellphones deliver 90% Web sites deliver 98% Failures are business-page news Many hackers. Outages last a “day” The Internet Changed Expectations This is progress?

Why (1) Complexity • Internet sites are MUCH more complex. • NAP • Firewall/proxy/ipsprayer • Web • DMZ • App server • DB server • Links to other sites • tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/os… • Skill level is much reduced

One of the Data Centers (500 servers)

A Schematic of HotMail • ~7,000 servers • 100 backend stores with 120TB (cooked) • 3 data centers • Links to • Passport • Ad-rotator • Internet Mail gateways • … • ~ 1B messages per day • 150M mailboxes, 100M active • ~400,000 new per day.

Functionality trend Schedule Quality Why (2) Velocity • No project can take more than 13 weeks. • Time to market is everything • Functionality is everything • Faster, cheaper, badder 

Why (3) Hackers • Hacker’s are a new increased threat • Any site can be attacked from anywhere • Motives include ego, malice, and greed. • Complexity makes it hard to protect sites. • Concentration of wealth makes attractive target: • Why did you rob banks? • Willie Sutton: Cause that’s where the money is! Note: Eric Raymond’s How to Become a Hackerhttp://www.tuxedo.org/~esr/faqs/hacker-howto.html is the positive use of the term, here I mean malicious and anti-social hackers.

How Bad Is It? http://www-iepm.slac.stanford.edu/ Connectivity is poor.

How Bad Is It? http://www-iepm.slac.stanford.edu/pinger/ • Median monthly % ping packet loss for 2/ 99

Microsoft.Com • Operations mis-configured a router • Took a day to diagnose and repair. • DOS attacks cost a fraction of a day. • Regular security patches.

Year 1 Through 18 Months Down 30 hours in July (hardware stop, auto restart failed, operations failure) Down 26 hours in September (Backplane failure, I/O Bus failure) BackEnd Servers are More Stable • Generally deliver 99.99% • TerraServer for example single back-end failed after 2.5 y. • Went to 4-nodecluster • Fails every 2 mo.Transparent failover in 30 sec.Online software upgradesSo… 99.999% in backend…

eBay: A very honest site http://www2.ebay.com/aw/announce.shtml#top • Publishes operations log. • Has 99% of scheduled uptime • Schedules about 2 hours/week down. • Has had some operations outages • Has had some DOS problems.

Not to throw stones but… • Everyone has a serious problem. • The BEST people publish their stats. • The others HIDE their stats (check Netcraft to see who I mean). • We have good NODE-level availability 5-9s is reasonable. • We have TERRIBLE system-level availability 2-9s is the goal.

Recommendation #1 • Continue progress on back-ends. • Make management easier (AUTOMATE IT!!!) • Measure • Compare best practices • Continue to look for better algoritims. • Live in fear • We are at 10,000 node servers • We are headed for 1,000,000 node servers

Recommendation #2 • Current security approach is unworkable: • Anonymous clients • Firewall is clueless • Incredible complexity • We cant win this game! • So change the rules (redefine the problem): • No anonymity • Unified authentication/authorization model • Single-function devices (with simple interfaces) • Only one-kind of interface (uddi/wsdl/soap/…).

References Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991. Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9 http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent. http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off!

Dependability in the Internet Era