Lesson learned after our recent cooling problem

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

Outline • INFN-T1 on-call procedure • Incident • Recover Procedure • What we learned • Conclusions Andrea Chierici

INFN-T1 on-call procedure

On-call service • CNAF staff on-call on a weekly basis • 2/3 times per year • Must live within 30min from CNAF • Service phone receiving alarm SMSes • Periodic training on security and intervention procedures • 3 incidents in last three years • only this last one required the site to be totally powered off  Andrea Chierici

Service Dashboard Andrea Chierici

Incident

What happened on the 9th of March • 1.08am: fire alarm • On-call person intervenes and calls Firefighters • 2.45am: fire extinguished • 3.18am: high temp warning • Air conditioning blocked • On-call person calls for help • 4.40am: decision is taken to shut down the center • 12.00pm: chiller under maintenance • 17.00pm: chiller fixed, center can be turned back on • 21.00pm: farm back on-line, waiting for storage Andrea Chierici

10th of March • 9.00am: support call to switch storage back on • 6.00pm: center open again for LHC experiments • Next day: center fully open again Andrea Chierici

Chiller power supply Andrea Chierici

Incident representation Ctrl sys Pow1 Ctrl sys Pow 2 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici

Incident examination • 6 chillers for the computing room • 5 share the same power supply for the control logic (we did not know that!) • Fire in one of the control logic, power was cut to 5 chillers out of 6 • 1 chiller was still working and we weren’t aware of that! • Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. Andrea Chierici

Facility monitoring app Andrea Chierici

Chiller n.4 BLACK: ElectricPowerin (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C) Andrea Chierici

Incident seen by inside Andrea Chierici

Incident seen by outside Andrea Chierici

Recover Procedure

Recover procedure • Facility: support call for an emergency intervention on chiller • recovered the burned bus and the control logic n.4 • Storage: support call • Farming: took the chance to apply all security patches and latest kernel to nodes • Switch on order: LSF server, CEs, UIs • For a moment we were thinking about upgrading to LSF 9 Andrea Chierici

Failures (1) • Old WNs • BIOS battery exhausted, configuration reset • PXE boot, hyper-threading, disk configuration (AHCI) • lost IPMI configuration (30% broken) Andrea Chierici

Failures (2) • Some storage controllers were replaced • 1% PCI cards (mainly 10Gbit network) replaced • Disks, power supplies and network switches were almost not damaged Andrea Chierici

What we learned

We fixed our weak point Ctrl sys Pow 3 Ctrl sysPow 2 Ctrl sys Pow1 Ctrl sys Pow 6 Ctrl sys Pow 4 Ctrl sys Pow 5 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici

We miss an emergency button • Shut the center down is not easy: a real “emergency shutdown” procedure is missing • We could have avoided switching down the whole center if we have had more control • Depending on the incident, some services may be left on-line • Person on-call can’t know all the site details Andrea Chierici

Hosted services • Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control • We need an emergency procedure for those too • We need a better understanding of the SLAs Andrea Chierici

Conclusions

We benchmarked ourselves • It took 2 days to get the center back on-line • less than one to open LHC experiments • everyone was aware about what to do • All working nodes rebooted with a solid configuration • A few nodes were reinstalled and put back on line in a few minutes Andrea Chierici

Lesson learned • We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) • The new dashboard appears to be the right place • We created a task-force to implement a controlled shutdown procedure • Establish a shutdown order • WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches • In case of emergency, on-call person is required to take a difficult decision Andrea Chierici

Testing shutdown procedure • The shutdown procedure we are implementing can’t be easily tested • How to perform a “simulation”? • Doesn’t sound right to switch the center off just to prove we can do it safely • How do other sites address this? • Should periodic bios battery replacements be scheduled? Andrea Chierici

Lesson learned after our recent cooling problem

Lesson learned after our recent cooling problem

Presentation Transcript

Recent Turnarounds – Lessons Learned

YMEP and lesson learned

Lesson Learned

Our Recent Experience

After this lesson :

Our lesson

Our Lesson

Our Lesson

Our Lesson

The cooling-flow problem

Our lesson

Our lesson

Our Lesson

Our Lesson

Our Lesson

Our Lesson

Our lesson

Our Lesson

Our lesson

The cooling flow problem

OUR LESSON

Our Lesson