FAULT TOLERANT POWER SYSTEMS

FAULT TOLERANT POWER SYSTEMS Carsten Nesgaard Advisors: Professor Michael A. E. Andersen Professor Seth R. Sanders Ext. collaborators:

Overview: The chart shown to the right represents the focal points in the Ph.D.-project as well as reflecting the key elements in the presentation at hand.

Increased awareness: Originating within the field of high accuracy software for critical applications, modern fault tolerance applies equally well to hardware systems, since the weakest link within a given system determines the overall reliability. An unreliable power supply would degrade system performance, although the remaining system elements are highly reliable. Consequences of system downtime: • Inability of financial transactions • Substantial losses in sales • Loss of customer services etc.

Fault tolerance (definition): The ability of a system to respond gracefully to an unexpected hardware or software failure. There are many levels of fault tolerance, the lowest being the ability to continue operation in the event of a power failure. Many fault tolerant computer systems mirror all operations - that is, every operation is performed on two or more duplicate systems, so if one fails the other can take over. Source: http://www.webopedia.com

Distribution Failure density f(t) Survivor function R(t) Hazard rate (t) Variance 2 Poisson - Gaussian Exponential Weibull Distributions: The following table contains the key functions and parameters concerning distributions in reliability evaluation:

Network modeling: Assuming the failure rate for each block/component within a given network can be found in the MIL-HDBK-217F the following simplifications can be applied: • Constant hazard rate  exponential distribution • MTBF  reciprocal of failure rate Reliability network reductions are independent of the distribution used: RSeries  RParallel 

System identification: Since no system can be made tolerant to all possible faults, it is essential that critical faults are identified and characterized during the design: • Critical faults with realistic probability of occurrence • The level of criticality (component, system, operator etc.) Two examples of critical failures in a redundant power supply: • Over-voltage at output (resulting in loss of load) • Short circuit of the input bus (resulting in loss of power) From the above-mentioned failures it can be seen that both lead to a loss of the load, thus undermining the concept of redundancy.

System identification: Fault isolation If critical failure-modes cannot be avoided in the design of a given system it is essential that these failure-modes are continuously monitored if fault tolerance within the system is to be maintained. Fault detection If a fault is detected within a given system the proper precautions must be taken by either dynamic replacement or redundancy. This prevents the propagation of a fault from its origin at one point within the system to a point where it can have a critical effect on a process or a user.

System identification: Fault prediction (estimation) As opposed to the above-mentioned topics that must be an integrated part of any fault tolerant system, a systems ability to predict faults based on continuous measurements of keycomponents is a desirable feature that is made possible mainly due to advances in digital controllers. Redundancy control: Based on the two keywords fault detection and fault isolation a redundancy control algorithm has been developed using array based logic. A paper describing the approach taken has been submitted to COMPEL2002.

System identification: Redundant network with mutually exclusive block failure rates. -values indicates proba-bility of block success. Dividing the three fault parameters into a highpower and a lowpower category, one sees that fault isolation falls into the high power category whereas detection and prediction of faults, fall into the low power category due to the surveillance nature of these topics.

Power system: Based on the system identification of the overall power system the following subjects must be considered: • Power supply topology (high efficiency, component stress etc.) • Control scheme • Redundancy vs. optimised component selection • Cost prize • Active/passive current sharing in redundant power supplies • Thermal surveillance • Probability of malfunction

Power system: In its basic form the Buck topology has no components directly connected across the power input vg(t). Source: Fundamentals of Power Electronics. Second ed. Erickson/Maksimovic Based on the data found in MIL-HDBK-217F, a table containing block level failure rates for different converter topologies shall be established.

Redundancy: No redundancy (series systems – high quality comp.) Full redundancy (parallel systems – low quality comp.) Partial redundancy Standby systems Reliability / availability: The definition of the term reliability relates to a systems ability to stay in the operating state without failure. Thus, reliability is totally unsuitable as a measure for continuously operated systems that can tolerate failures. To describe the latter type of systems the term availability is used. The interpretation of this term is: The probability of finding the system in the operating state at some time into the future.

Digital vs. analog control: Surveillance and control of highly reliable power supplies can be performed by either digital or analog circuitry. Traditionally the analog approach has been taken (bandwidth, accuracy etc.) With increased processor speed and lower cost the digitally approach presents a wide variety of sophisticated control schemes that enables ‘intelligent’ determination of redundancy management.

Digital vs. analog control: The main purposes for implementing a digital control scheme in DC/DC converter applications are: • Possibility of advanced fault detection (location, impact etc.) • Fault isolation (controlled shut-down, redundancy control etc.) • Fault estimation based on selected measurement parameters

Digital vs. analog control: The following list of pros and cons concerns the power systems surveillance and control circuitry. Analog: Digital: Pros: Cons: Pros: Cons: Short reaction time High accuracy Noise margin Temperature stability Implem. of control algorithms Multiple surveillance functions Noise and temperature sensitive Non or very little ‘intelligence’ Single function surveillance circuitry Discrete values – thus bit errors Finite sample time

Digital vs. analog control: In order to test the implementation of different surveillance schemes a Buck converter has been assembled. Test converter with switches for external fault simulation 4 measurement points for oscilloscope connection 4 switches for fault simulation Interface to microcontroller incl. various meas. parameters

Chosen approach: Based on this presentation the following basic rules have been deduced: • Know precisely what the system is supposed to do when working under both normal and abnormal circumstances. • Group fault causes into different classes. Thus, identifying and categorizing all critical failure-modes. • Determine fault containment regions within the system. This is important since fault propagation in any system is to be prevented. • Determine the application failure margins and balance the level of fault tolerance with the cost of implementation.

Summary: An overview of the main topics within the field of fault tolerant power systems has been presented. These include: • Identification of power systems • Probability analysis of power systems • Digital vs. analog control schemes • Fault detection, fault isolation and fault prediction

FAULT TOLERANT POWER SYSTEMS

FAULT TOLERANT POWER SYSTEMS

Presentation Transcript

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Distributed Systems

Fault-Tolerant Broadcast

CprE 545: FAULT-TOLERANT SYSTEMS

Distributed systems II Fault-Tolerant AGREEMENT

CprE 545: Fault Tolerant Systems

CprE 545: Fault Tolerant Systems

Distributed systems II Fault-Tolerant AGREEMENT

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: Fault Tolerant Systems

Distributed systems II Fault-Tolerant Broadcast

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

CPR E 545: Fault Tolerant Systems

fault-tolerant

Distributed systems II Fault-Tolerant Broadcast

Distributed systems II Fault-Tolerant AGREEMENT

Distributed systems II Fault-Tolerant AGREEMENT

Fault-Tolerant Computing Systems #1 Introduction

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance