190 likes | 317 Views
DCS Report on Warnings and Alarms During Cooling Failure on Feb 4 11 Feb, 2009. K. Grogg. Temperature Alerts. Types of temperature alerts Software Warning Appears in alarm panel as RMCX.crateY.tmpz HIGH No action taken (could add email alert) Hardware (RMC firmware) Warning
E N D
DCS Report on Warnings and AlarmsDuring Cooling Failure on Feb 411 Feb, 2009 K. Grogg
Temperature Alerts • Types of temperature alerts • Software Warning • Appears in alarm panel as RMCX.crateY.tmpz HIGH • No action taken (could add email alert) • Hardware (RMC firmware) Warning • Appears in alarm panel as AlarmStatusCrates Warning • No action taken for Warning_trip_time • Don’t want to be too hasty shutting down • Trip Conditions • Warning Trip Time • Time before the AlarmStatusCrates Warning becomes a Fault • Do not want higher than normal temps for too long • 48 V shut off if in hardware warning for this length of time • Hardware (RMC firmware) Temperature Fault • Appears in alarm panel as alarm ALARM and AlarmStatusCrates Fault • Too hot, 48 V shut off immediately • Rack trip temp • Will shut off rack power if above threshold set by central DCS • We do not want this method of shutting down the crates
Temperature Threshold Settings • Current values (can be adjusted as needed)
Optimizing Thresholds • Want to keep electronics in a healthy state • It is best for electronics to keep temperature in as narrow a window as possible • We do not want the racks to turn off the crate power • RMCs should control when crates turn off – allows continued monitoring • Also allows for fault records to be kept (not enough time to write the information if the rack shuts off the 48V) • Need to set the software, firmware, and rack temperature and time thresholds with the following goals: • Ensure a minimum amount of time at excessive temperatures • Avoid shutting down unnecessarily for smaller fluctuations • Ensure RMCs turn off crate power, not the racks • Tom set current thresholds based on measurements at UW lab • Default in firmware version 2.1 • New suggestions: • Fault threshold ~8-10 deg above nominal • Warning threshold halfway between • 4-5 deg above “normal” likely too low because of normal fluctuations
Screenshot of Alarm panel HIGH – Software high temperature threshold reach by at least one temp (a, b, c, d) AlarmStatusCrates WARNING – At least one temp reached 1st hardware threshold alarm ALARM – At least one temp reached 2nd hardware threshold CrateA.vmeX FAULT – Crates off (48V off) AlarmStatusCrates FAULT – Occurs with ALARM when the hardware Fault threshold has been exceeded -and- occurs when at least one temperature has been above threshold for more than warning trip time (600s) - New entries with identical information are overwritten, so only the second occurrence of AlarmStatusCrates FAULT remains in the alarm panel
Current System Configuration Hardware thresholds Same for all RMCs Refresh button must be clicked to get up-to-date information
Screenshot of Fault panel Get temperatures at time of Fault Get time between warning and fault Get time Faults occurred Open Fault panel (not complete information if rack turns off first)
Effects of recent cooling failure • Need to make sure the thresholds are optimal • Cooling lost on 4 Feb, 2009, at about 14:40 • Time between first HIGH (software warning) and AlarmStatusCrates WARNING: • 7:45-11:15 minutes • Time between AlarmStatusCrates WARNING and alarm ALARM • 295– 400 s • After 600s (time over t_warning setting), AlarmStausCrates indicated FAULT (for second time, the first time with alarm) • Temperatures were still above the hardware warning setting • Typically one or two temps above hardware warning • Typically one temp (Crate A, temp B) above hardware failure • Temp B has the lowest threshold (and lowest baseline) • Specific times and temperatures for each RMC are given on next 9 slides
RMC 1 Temps & Times • 4 Feb, 2009 All temperatures given for all RMCs are for Crate A which is consistently higher Values in red are above thresholds
RMC 2 Temps & Times • 4 Feb, 2009 When rack cuts power before RMC shuts down crates there is only a partial fault record
Conclusions • RMC hardware limits are set to trip 48V when needed and not rely on the rack to turn off the power • Except RMC 2, crate power turned off before racks (rack power stayed on) • Trip temperature limits should be lowered at least a little bit • But don’t want to trip if there is no problem • Slight temperature fluctuations possible – seasonal changes, cooling water fluctuations, precision of reading, additional non-RCT electronics in rack • Excursions up to +3 deg from baseline appear to happen during normal operation • Suggestions from Monika • Lower hardware Warning and Fault thresholds by 2-3 deg. • Send email after software warning (allows some time for action to be taken)
Other Items • This presentation will be put on the RCTSlowControl twiki for reference • A remote UI has been set up on hpwiscms02 (laptop) • Allows monitoring without logging into terminal server • Documentation on setting up a remote UI has been added to twiki
RMC 3 Temps & Times • 4 Feb, 2009
RMC 4 Temps & Times • 4 Feb, 2009
RMC 5 Temps & Times • 4 Feb, 2009
RMC 6 Temps & Times • 4 Feb, 2009
RMC 7 Temps & Times • 4 Feb, 2009
RMC 8 Temps & Times • 4 Feb, 2009
RMC 9 Temps & Times • 4 Feb, 2009