270 likes | 390 Views
Lesson learned after our recent cooling problem. Michele Onofri , Stefano Zani , Andrea Chierici HEPiX Spring 2014. Outline. INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions. INFN-T1 on-call procedure. On-call service.
E N D
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014
Outline • INFN-T1 on-call procedure • Incident • Recover Procedure • What we learned • Conclusions Andrea Chierici
On-call service • CNAF staff on-call on a weekly basis • 2/3 times per year • Must live within 30min from CNAF • Service phone receiving alarm SMSes • Periodic training on security and intervention procedures • 3 incidents in last three years • only this last one required the site to be totally powered off Andrea Chierici
Service Dashboard Andrea Chierici
What happened on the 9th of March • 1.08am: fire alarm • On-call person intervenes and calls Firefighters • 2.45am: fire extinguished • 3.18am: high temp warning • Air conditioning blocked • On-call person calls for help • 4.40am: decision is taken to shut down the center • 12.00pm: chiller under maintenance • 17.00pm: chiller fixed, center can be turned back on • 21.00pm: farm back on-line, waiting for storage Andrea Chierici
10th of March • 9.00am: support call to switch storage back on • 6.00pm: center open again for LHC experiments • Next day: center fully open again Andrea Chierici
Chiller power supply Andrea Chierici
Incident representation Ctrl sys Pow1 Ctrl sys Pow 2 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici
Incident examination • 6 chillers for the computing room • 5 share the same power supply for the control logic (we did not know that!) • Fire in one of the control logic, power was cut to 5 chillers out of 6 • 1 chiller was still working and we weren’t aware of that! • Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. Andrea Chierici
Facility monitoring app Andrea Chierici
Chiller n.4 BLACK: ElectricPowerin (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C) Andrea Chierici
Incident seen by inside Andrea Chierici
Incident seen by outside Andrea Chierici
Recover procedure • Facility: support call for an emergency intervention on chiller • recovered the burned bus and the control logic n.4 • Storage: support call • Farming: took the chance to apply all security patches and latest kernel to nodes • Switch on order: LSF server, CEs, UIs • For a moment we were thinking about upgrading to LSF 9 Andrea Chierici
Failures (1) • Old WNs • BIOS battery exhausted, configuration reset • PXE boot, hyper-threading, disk configuration (AHCI) • lost IPMI configuration (30% broken) Andrea Chierici
Failures (2) • Some storage controllers were replaced • 1% PCI cards (mainly 10Gbit network) replaced • Disks, power supplies and network switches were almost not damaged Andrea Chierici
We fixed our weak point Ctrl sys Pow 3 Ctrl sysPow 2 Ctrl sys Pow1 Ctrl sys Pow 6 Ctrl sys Pow 4 Ctrl sys Pow 5 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici
We miss an emergency button • Shut the center down is not easy: a real “emergency shutdown” procedure is missing • We could have avoided switching down the whole center if we have had more control • Depending on the incident, some services may be left on-line • Person on-call can’t know all the site details Andrea Chierici
Hosted services • Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control • We need an emergency procedure for those too • We need a better understanding of the SLAs Andrea Chierici
We benchmarked ourselves • It took 2 days to get the center back on-line • less than one to open LHC experiments • everyone was aware about what to do • All working nodes rebooted with a solid configuration • A few nodes were reinstalled and put back on line in a few minutes Andrea Chierici
Lesson learned • We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) • The new dashboard appears to be the right place • We created a task-force to implement a controlled shutdown procedure • Establish a shutdown order • WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches • In case of emergency, on-call person is required to take a difficult decision Andrea Chierici
Testing shutdown procedure • The shutdown procedure we are implementing can’t be easily tested • How to perform a “simulation”? • Doesn’t sound right to switch the center off just to prove we can do it safely • How do other sites address this? • Should periodic bios battery replacements be scheduled? Andrea Chierici