180 likes | 281 Views
Resiliency – an Emerging “ility”. Dr . Richard Mayer President, Speed of Light Consulting LLC July 14, 2011. DOE Smart Grid Report 2009. 3.6 Operates Resiliently to Disturbances, Attacks, and Natural Disasters
E N D
Resiliency – an Emerging “ility” Dr. Richard MayerPresident, Speed of Light Consulting LLCJuly 14, 2011
DOE Smart Grid Report 2009 3.6 Operates Resiliently to Disturbances, Attacks, and Natural Disasters “Resiliency” refers to the ability of a system to react to events such that problematic consequences are isolated with minimal impact to the remaining system, and the overall system is restored to normal operation as soon as practical. These self-healing actions result in reduced interruption of service to consumers and help service providers more effectively manage the delivery infrastructure. Resiliency includes protection against all hazards, whether accidental or malicious, and needs to span natural disasters, deliberate attack, equipment failures, and human error. A smart grid inherently addresses security from the outset as a requirement for all the elements, and ensures an integrated and balanced approach across the system. From the point of view of the Nation’s national security, this characteristic is arguably the most important. Resiliency in the face of adverse conditions or aggression, particularly high-consequence events, underlies all aspects of a smart grid and cuts across the other characteristics. Resiliency is embedded in operational culture: policy, procedures, and vigilance. It is embodied through effective risk management, with thorough understanding and management.
http://www.resilience-engineering.org/ Resilience Engineering The term Resilience Engineering represents a new way of thinking about safety. Whereas conventional risk management approaches are based on hindsight…, Resilience Engineering looks for ways …to create processes that are robust yet flexible, to monitor and revise risk models, and to use resources proactively in the face of disruptions or ongoing production and economic pressures. (scalable, evolvable systems)
Examples of resilient design: • Self-annealing, autonomous networks • Uploadable spacecraft software • TSAT communication system had provision for mobile ground control centers with reduced capability, in case fixed control center was attacked.
Is it resilience or resiliency? • Dictionary.com says they are the same. • Resilient = able to absorb/ deal with a deformation/disruption and return to a previous/functioning state. • Resilience = property of an object/system of being resilient • Resiliency = abstract property of being resilient, e.g., as a subject for study. Probably not needed; resilience would suffice.
Scott Jackson, Architecting Resilient Systems: Accident Avoidance, Survival and Recovery from Disruptions. INCOSE Webinar December 16, 2009 Definition from Westrum: • The ability to anticipate a disruption and prevent something bad from happening – avoidance • The ability to prevent something from getting worse – survival • The ability to recover from something bad once it has happened – recovery Note: According to Westrum, a system only needs two of these three abilities to be resilient. Operational resiliency has three basic descriptive properties (Caralli et al. 2006): 1. ability to change (adapt, expand, conform, contort) when a force is enacted, 2. ability to perform adequately or minimally while the force is in effect, 3. ability to return to a predefined, expected normal state whenever the force relents or is rendered ineffective. [I would add: “or to be returned” by outside intervention]
Distinguish resilience from prevention • Semantics: compare to elasticity. To a mechanical engineer (but not to ordinary people) a rubber band and an I-beam are both elastic; they retain their original shape after the force is removed. • I would say that the rubber band is resilient to a point, while the I-Beam is insusceptible to a point, then not resilient at all. • Resilience implies some disruption in performance, and some degree of recovery.
Relationship to other “ilities”: Reliability Reliability vs. resilience: Neither mutually exclusive, nor identical. Reliability is the probability of achieving specified performance over a specified period. DAG: Reliability requirements address mission reliability and logistics reliability. …Both … typically include resilience to some component failures or even attack by having built-in redundancy and maintenance. Reliability is calculated on a specific design. Resilience includes going outside the specific design to restore function by external intervention. The difference could become blurred if you include, for example, software patches to a spacecraft to implement a work-around after a failure. If the workaround was anticipated in the maintenance plan, this could be reliability; if it was developed only in response to an incident, I would call it resilience and not reliability. So a resiliency requirement might be to have modular and upgradeable software. Thus open-system architecture is an enabler of resilience No single-point failures: a common requirement for spacecraft (always some exceptions). A factor in reliability. Also a factor in enabling resilience.
Some Accidents: Instances of (Non)Resilience? • Power restoration on 9/11: systems worked together. Enablers were there: distributed generators, emergency generators, communications, C&C. • The resilience was not in any one system. • Challenger: a resilience problem?? If you include Pgm Mgmt and NASA “culture” as the SoS of building and operating the shuttle, then perhaps it is. • Metrolink 111: Systemic deficencies: factors 1 and 2 of Westrum: avoidance and damage control. Rail network was deficient. Train or traffic control was deficient. Should have been prevention in the Metrolink train. • A system of systems. The engineer is a “system” separate from the train.
Resilience is Primarily for System of Systems • Resiliency can apply to a system, much more relevant to a system of systems. • SoS can have much more resiliency than individual systems acting alone. • Role of a system: to enable or contribute to SoS resiliency. • Self-healing is part of resiliency. But SoS intervention is probably a bigger part. • Most disaster responses of the type discussed by Jackson involve a system of systems working or not working in cooperation. • Deming on Quality applies to resilience: management thinks 80% of failures are labor’s fault, and labor thinks 80% are management’s fault, & labor is right: management has the ability to do something to prevent 80% of the failures. • Easier to change the environment than to change human nature
Heuristics (Collected by Jackson) $$ • Capacity Heuristics: • Absorption/margin, functional redundancy, physical redundancy • Context spanning: System should be designed to both worst case and most likely scenarios (Madni) • Flexibility Heuristics: • Self-reorganization (Woods), organizational flexibility • Human backup/in-the-loop/in control • Predictability, simplicity/complexity avoidance, loose coupling • Rule from Deming, Quality: If you must change the human or change the environment, change the environment. $$ Human vs. automated pendulum Modularity
Heuristics (Continued) • Tolerance Heuristics: • Graceful degradation, drift correction, mobility, prevention, deterrence • Inter-Element Collaboration Heuristics • The human operator should be informed (Billings) • Maximize knowledge between nodes (Billings) • Intent awareness: knowledge of the others’ intent and back up each other... (Madni and Billings) • No inter-element impediments to collaboration. (Jackson)
Implementing Resilience: ABB Paper on a Smarter GridNote the Resilience Enablers
Jackson: Some Conclusions About Disruptions • Failures may occur when all the components of a system function as designed; these are called systemic failures. The system “failed” because it encountered a situation not envisioned in development and therefore not elaborated in a requirement. (See Context-spanning heuristic) • The large numbers of possible interactions of elements of a system causes the probability of an accident to be much larger than the individual failure would imply. ( sum of individual failures) • Example: sneak paths in software or hardware that only get triggered in rare situations, or in failure situations, or after unplanned disruptions. Are they failures? • Practically impossible to ferret out all SW “bugs” in today’s systems. Resiliency might mean the capability to restart or go to a safe mode if the SW gets lost (as opposed to finding all the bugs). SW reliability reqt?
Comments on Jackson’s Conclusions • Causes of catastrophes are beyond the domains of traditional reliability, safety and other reductionist approaches – Not entirely. What do we mean by “the cause”? Cause of disruption (terrorist attack)? Causes of damage (not designed to withstand an intense fire, reactor not designed to survive power loss) are not “beyond the domains.”. • My take: Addressing the response may be beyond traditional domains. Resiliency is largely beyond the domains of the traditional approaches, because response most likely must take place at the SoS level. • Disruptions (e.g., human error) are not the cause of catastrophes; they simply initiate it. This is semantics. The initiation is a cause, but not the only cause, especially of the extent of the damage and the inability to respond. [Train accident in LA. No fail-safe measures in either train or in the rail system.] • A system can be architected to create resilience. I would say: a system can be architected to enable and achieve a measure of resilience. A System of Systems can be empowered to achieve resilience of its systems. • The primary aspects of resilience are adaptability, risk and culture. Risk and culture are not properties of a system. Risk assessment can drive creation of resilience. • Future work includes: (1) validation of heuristics, (2) development of metrics, (3) others I would nominate: • Determining and requiring enablers • Analyzing resilience as a requirement in SoS architecture and operations.
What to Do with Resilience? • Resilience is enabled by system requirements: status data, modularity, mobility, intrusion detection, etc., and achieved by SoS architecture and operations. • Is resilience something to be addressed in its own right, or is it adequately enabled under existing “ilities”? How do you develop system/element requirements to make them resilient? • How much “extra capacity” can we afford to achieve the objectives? What is the value or cost/benefit equation when lives are at stake? • An approach to flush out enablers: Scenario study If the unthinkable happens… • What should “we” (the country, company, …) have in place to deal with it? • We want to minimize the loss of life. • We want all emergency responders to be able to communicate. • We want to restore service (power grid, transportation …) • We want no permanent loss of our data.
Resources • INCOSE Resilient Systems Working Group http://www.incose.org/practice/techactivities/wg/rswg/ • http://www.resilience-engineering.org/ • Scott Jackson. Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions. Wiley. 2009 • Cognitive Technologies Laboratoryhttp://www.ctlab.org/ • Ashgate Publishing's Resilience Engineering Perspectives Series:Remaining Sensitive to the Possibility of Failure, which compiles papers from the November 2007. 2nd Symposium on Resilience Engineering in Juan-Les-Pins France. Resilience Engineering: Concepts and Precepts. • The 3rd International Symposium on Resilience Engineering • Resilience engineering: concepts and precepts By Erik Hollnagel, David D. Woods, Nancy Leveson • CERT – Resiliency Management http://www.cert.org/resiliency/ emphasis on Information Technology • Center for Resilience at the Ohio State University http://www.resilience.osu.edu/CFR-site/aboutus.htm