370 likes | 671 Views
Cloud Computing CS 498 Introduction to Computer Dependability. Dr. Catello Di Martino Coordinated Science Laboratory University of Illinois at Urbana-Champaign dimart@illinois.edu. Outline. Motivation for reliable system design Taxonomy of dependable computing
E N D
Cloud Computing CS 498 Introduction to Computer Dependability Dr. Catello Di Martino Coordinated Science Laboratory University of Illinois at Urbana-Champaign dimart@illinois.edu
Outline • Motivation for reliable system design • Taxonomy of dependable computing • Fault classes (hardware and software) • Failure sources
Recommended Texts • [Prad96] D.K. Pradhan, ed., Fault Tolerant Computer System Design, Prentice-Hall, 1996 • [John89] B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley, 1989 • [SiSw92] D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems - Design and Evaluation, Digital Press (distributed by Butterworth), 1992, 2nd edition. • [Lyu95a] M.R. Lyu, Handbook of Software Reliability Engineering, McGraw-Hill, 1995 • [Lyu95b] M.R. Lyu, ed., Software Fault Tolerance, J. Wiley & Sons, 1995 • [Birm96] K.P. Birman, Building Secure and Reliable Network Applications, Manning, 1996 • [SiSh94] M. Singhal and N.G. Shivaratri, Advanced Concepts in Operating Systems, McGraw-Hill, 1994
Science: An Evolving Definition • “ The intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment ” [Webster dictionary] i.e., through data!
Who is going to analyze this sea of data? Where…how…where is my CLOUD-bat-gizmo? SOURCE: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html
2007: Wired announces a new quest • “Even having the data, the real challenge is to be able to use it!” [cit.] • Discovering what we don’t know from the data • Obtaining predictive, actionable insight from data
Future Growth: Data and Computing for Societal Impact Assuring security and safety of the nations Cloud computing infrastructure Robust computing at low-cost , “pay-as-you-go” Global vigilance and Reach Intelligent Eco Systems: Trustworthy, Cost effective Environment friendly Large volume of data Phones, Sensors Smart cars Analysis Integration Benefits to individuals & society Individuals & enterprises HMI Modern health care Adaptive Power Grid Efficient transportation (air, ground, sea) Human expertise Innovations Education Research Preservation of water New age agriculture
Cloud Computing - The Hype Cycle Present: Cloud Computing is at the peak. New technologies and tools for early maturation of the cloud Cloud Computing goes mainstream Emerging technologies of Cloud Computing Challenges in bringing Cloud Computing to mainstream. Source: Gartner, Hype Cycle for Cloud Computing, D. M. Smith Publication, July 2011
Each Step of the technological enhancement requires new solution at scale.
276 Cabinets 24 Blade per Cabinet Total 3072 cores /Cabinets XE6 Blade: 4 Compute nodes (8CPU) per Blade. Total 5660 Blades 128 Cores/Blade Freyon/Water Chiller pipes 64GB RAM per compute node 32GB RAM per GPU node Total 1.6 PB 199,990 x 8GB DDR3 196 Sonexion 1600 / Lustre filesystem Total: 26PB File System 86 Disks per Drawer 20,192 2TB SAS disks 396 1TB SSD XE7 Blade: 4 Compute nodes (4CPU + 4 NVIDIA GPU) per Blade. Total 4192 XK7 Blades 22640 Compute nodes 2 CPU per node 16 cores per CPU Total 45280 AMD Opteron 6272 4192 Nvidia K20x Kepler GPU 2168 Cuda cores/GPU 6GB On-board DDR5 RAM Total 9,088,256 Cuda Cores
Example of Average Cost per Hour of Downtime • Achieving highly reliable (and secure) systems/applications requires a significant level of unique expertise in designing and evaluating such systems, e.g., • How do we detect errors? • Does one need fast recovery or recovery to the exact state prior to the failure or both? • If you do not resume processing right where you left off will it be acceptable or maybe damaging or catastrophic?
System Outages: Major Examples Dublin Airport shut down after radar system failure (July 10, 2008 ) Amazon.com suffers eight hour outage (July 20, 2008) Resilience to attacks and faults is crucial to dependable computing North-Eastern blackout (Aug 14, 2003)
Cloud Computing: Growing Number of Outages • Providing a higher level of reliability and availability is one of the biggest challenges of Cloud computing Google Insight for Search: Cloud Computing Jul 08: Amazon S3 down 8.5h due to one single bitflip in Gossip message Outage in: Oct 09: MS Azure down 22h due to malfunction in the hypervisor Apr 11: Amazon EC2 US East down 4 days due to Network problem and replicas algorithm Feb 11: 40K Gmail Account down 4 days due to a bug in a storage software update 2007 2008 2009 2010 2011 R. Iyer, Z. Kalbarczyk, and N. Nakka, “Dependable Computing: Design and Assessment,” Draft of forthcoming text, 2012
Jul’08 - Spammers set up mail spamming instances in the Amazon’s EC2 cloud. Apr’09 - Texas datacenters operations are suspended for FBI investigation. Sep’10 - Google Engineer Stalked Teens, Spied on Chats Dec’10 - Microsoft BPOS cloud service hit with data breach Nov’09 - Side channel attack of Amazon’s EC2 service. June’11 - Dropbox: Authentication Bug Left Cloud Storage Accounts Wide Open Dec’09 - Zeus crime-ware using Amazon's EC2 as command and control server. Cloud Computing: Security Problems Dec’10 - Anonymous hacker group failed to take down Amazon
Cloud Reaching our houses: Smart Power Grids Microgrid power lines Local Power Generators Local load, utilities VM VM VM VM VM Control Network State Estimation and Control of the Grid Microgrid SCADA Microgrid Cloud Hypervisor Cloud IaaS
Dependable Computing • Original definition of dependability (that stresses the need for justification of trust) states that: the dependability is the ability to deliver service that can justifiably be trusted • The alternate definition (that provides the criterion for deciding if the service is dependable) states that:the dependability of a system is the ability to avoid service failures that are more frequent and more severe than is acceptable Readiness for correct service AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY INTEGRITY MAINTAINABILITY FAULT PREVENTION FAULT TOLERANCE FAULT REMOVAL FAULT FORECASTING FAULTS ERRORS FAILURES Continuity of correct service Absence of catastrophic consequences ATTRIBUTES Absence of unauthorized disclosure of data Absence of improper system alternation DEPENDABILITY (and SECURITY) Ability to undergo modifications and repairs MEANS THREATS
Examples of Computer-related Failures FAULTS FAILURES Availability / Reliability Confidentiality Safety Design Localized Physical Distributed Interaction False alerts at the North American Air Defense (NORAD) [Ford 85] First launch of the Space Shuttle postponed [Gaman 81] Excessive radiotherapy doses (Therac-25) [Leveson & Turner 93] Internet worm [Spatford 89] 9 hours outage of the long-distance phone in the USA [Neumann 95] Scud missed by a Patriot (Dhahran, Gulf War) [Neumann 95] Crash of the communication system of the London ambulance service [HA 93] Authorization denial of credit card operations in France The maiden flight of the Arine 5 launcher ended in a failure (France) 8-hour outage of Amazon services June 1980 April 1981 June 1985 - January 1987 November 1988 15 January 1990 February 1991 November 1992 26 and 27 June 1993 4 June 1996 July 20, 2008
Based on the temporal persistence Permanent faults,whose presence is continuous and stable. Intermittentfaults, whose presence is only occasional due to unstable hardware or varying hardware and software states (e.g., as a function of load or activity). Transientfaults, resulting from temporary environmental conditions. Based on the origin Physicalfaults, stemming from physical phenomena internal to the system, such as threshold change, shorts, opens, etc., or from external changes, such as environmental, electromagnetic, vibration, etc. Human-made faults, which may be either design faults, introduced during system design, modification, or establishment of operating procedures, or interaction faults, which are violation of operating or maintenance procedures. Fault Classes
Fundamental Chain of Dependability Example 1 • A short in an integrated circuit is a failure (with respect to the function of the circuit) • The consequence (e.g., stack at a Boolean value) is a fault that stays dormant until activated • Upon activation (invoking the faulty component by applying an input) the fault becomes active and produces an error • If and when the propagated error affects the delivered service (e.g., information content), a failure occurs propagation activation … … fault error failure fault • Example 2 • The result of an error by a programmer leads to a failure to write the correct instruction or data • This results in a dormant fault in the written software (e.g., faulty instruction) • Upon activation the fault become active and produces an error • When the error affects the delivered service , a failure occurs • Example 3 • An inappropriate human-system interaction performed by an operator is an external fault (from the system view point) • Resulting altered processed data is an error, ……
System Stack Application Soft errors Operating System/ Virtual Machine Architecture Soft Errors, Timing Errors, etc. Timing errors Devices/Circuits
System Stack Application Design Errors Operating System/ Virtual Machine Design errors + Manufacturing defects Architecture Devices/Circuits Manufacturing defects
System Stack Application Errors in kernel or device drivers Operating System/ Virtual Machine Kernel Error Architecture Devices/Circuits Driver error
System Stack Concurrency bugs and Memory corruption errors Application Concurrency bugs Operating System/ Virtual Machine Architecture Devices/Circuits Memory corruption
System Stack User perceives application malfunction Goal: Prevent errors in any layer of the system stack from impacting application’s functionality Application Operating System/ Virtual Machine Errors Architecture Devices/Circuits
Transient/Soft Errors Cause the Highest Failure Rate of all the other Mechanisms Combined Real world implications • Transient vs Permanent ratio (10:1) • Cell Phone (2003 vs. today): 2003 Memory cell phone was 4Mb mobile device Soft Error Rate (SER) = 1000 FIT/Mbit • Reset rate: one every 28 years Today: iPhone 4S has 512MB memory + 64GB of SS disk Reset rate: one every 10 days • Router (2003) Memory size: 100 Gbit SER = 600 FIT/Mbit Error rate: every 17 hours • Laptop at 35,000 ft (2003) Memory size: 256MB = 2Gbit SER = 600FIT/Mbit which becomes…~100,000FIT/Mb Error rate: every 5 hours. FIT: number of failures in 10e9 h of operation IEEE International Reliability Physics Symposium ’03
System View of Assured Computing Applications What can be provided in software and application itself? Middleware, hypervisor/ VM How to combine hardware and software techniques – (1) fast detection in hardware, (2) high efficiency detection and recovery in software How to assess whether the achieved availability and security meets requirements Reliable & securecommunications What can be provided in the communication layer? What can be provided in the operating system? Operating system System network What can be provided in hardware to ensure fail-silent behavior of system components ? Hardware Processing elements Memory Storage system
Fault Cycle & Dependability Measures Reliability: a measure of the continuous delivery of service; R(t) is the probability that the system survives (does not fail) throughout [0, t]; expected value: MTTF(Mean Time To Failure) Previous repair Fault occurs Maintainability: a measure of the service interruption M(t) is the probability that the system will be repaired within a time less than t; expected value: MTTR (Mean Time To Repair) FAULT Latency Error – (fault becomes active) MTTF MTBF Availability: a measure of the service delivery with respect to the alternation of the delivery and interruptions A(t) is the probability that the system delivers a proper (conforming to specification) service at a given time t ; expected value: EA = MTTF / (MTTF + MTTR) ERROR Latency Error/failure detection (e.g., parity error) REPAIR TIME MTTR Repair memory Safety: a measure of the time to catastrophic failure S(t) is the probability that no catastrophic failures occur during [0, t]; expected value: MTTCF(Mean Time To Catastrophic Failure) Next fault occurs
Why Detection is Critical ? Detection prevents error propagation and preempts crashes Incorrect Output ! Crash Latency DETECT Program Crash ! Error Detected Error Activated ! Benign Error Event Error Occurs ! Success Unactivated Error Time
System View of Dependable Computing Applications What can be provided in software and application itself? Application program interface (API) SIFT Middleware How to combine hardware and software fault tolerance techniques - (1) fast error detection in hardware, (2) high efficiency detection and recovery in software How to assess whether the achieved availability meets system requirements What can be provided in the communication layer? Reliable communications What can be provided in the operating system? Operating system System network What can be provided in hardware to ensure fail-silent behavior of system components ? Hardware Processing elements Memory Storage system
How do We Achieve the Objectives? Applications Checkpointing and rollback, application replication, software voting (fault masking), process pairs, robust data structures, recovery blocks, N-version programming Application program interface (API) SIFT Middleware CRC on messages , acknowledgment, watchdogs, heartbeats, consistency protocols Reliable communications Memory management, detection of process failures, hooks to support software fault tolerance for application Operating system System network Hardware Error correcting codes, N_of_M and standby redundancy , voting, watchdog timers, reliable storage (RAID, mirrored disks) Processing elements Memory Storage system
Faults, Errors, and Failures in Computing Systems Faults Errors Failures Failure to Meet Requirements Reliability, long term - Mission life Reliability, short term - Critical functions - Database protection Availability Detection latencies Containment boundaries Recovery latencies Autonomy Permanent (hard) faults - Natural failures - Natural radiation - HW design errors Transient (soft) faults - Power transients - Switching transients - Natural radiation - Single upsets - Multiple upsets Intermittent faults - Natural failures - Power transients Software faults - SW design errors - System upgrades - Requirements changes External faults Processor
s-a-0 Hardware Fault Models Stack-at Module level Functional level System level Example: Memories One or more cells arestuck at 0 or 1 One or more cells fail to undergo 0-1 or 1-0 transition Two or more cells arecoupled A 1-0 transition in one cell changes contents in another cell More than one cell isaccessed during READor WRITE A wrong cell is accessedduring READ or WRITE Example: a parallelprocessor topology View machine as agraph - nodes correspond to processors - edges correspond to links Fault Model: A processor (node) orlink (edge) faulty Example: decoder No output linesactivated An incorrect lineactivated instead of desired line An incorrect lineactivated in additionto desired line Example: physical failures in circuits Lines in a gate level stuck at 0 or 1 Faulty contact Transistor stuck open or closed Metal lines open Shorts between adjacent metal lines