1 / 27

Lesson learned after our recent cooling problem

Lesson learned after our recent cooling problem. Michele Onofri , Stefano Zani , Andrea Chierici HEPiX Spring 2014. Outline. INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions. INFN-T1 on-call procedure. On-call service.

zhen
Download Presentation

Lesson learned after our recent cooling problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

  2. Outline • INFN-T1 on-call procedure • Incident • Recover Procedure • What we learned • Conclusions Andrea Chierici

  3. INFN-T1 on-call procedure

  4. On-call service • CNAF staff on-call on a weekly basis • 2/3 times per year • Must live within 30min from CNAF • Service phone receiving alarm SMSes • Periodic training on security and intervention procedures • 3 incidents in last three years • only this last one required the site to be totally powered off  Andrea Chierici

  5. Service Dashboard Andrea Chierici

  6. Incident

  7. What happened on the 9th of March • 1.08am: fire alarm • On-call person intervenes and calls Firefighters • 2.45am: fire extinguished • 3.18am: high temp warning • Air conditioning blocked • On-call person calls for help • 4.40am: decision is taken to shut down the center • 12.00pm: chiller under maintenance • 17.00pm: chiller fixed, center can be turned back on • 21.00pm: farm back on-line, waiting for storage Andrea Chierici

  8. 10th of March • 9.00am: support call to switch storage back on • 6.00pm: center open again for LHC experiments • Next day: center fully open again Andrea Chierici

  9. Chiller power supply Andrea Chierici

  10. Incident representation Ctrl sys Pow1 Ctrl sys Pow 2 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici

  11. Incident examination • 6 chillers for the computing room • 5 share the same power supply for the control logic (we did not know that!) • Fire in one of the control logic, power was cut to 5 chillers out of 6 • 1 chiller was still working and we weren’t aware of that! • Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. Andrea Chierici

  12. Facility monitoring app Andrea Chierici

  13. Chiller n.4 BLACK: ElectricPowerin (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C) Andrea Chierici

  14. Incident seen by inside Andrea Chierici

  15. Incident seen by outside Andrea Chierici

  16. Recover Procedure

  17. Recover procedure • Facility: support call for an emergency intervention on chiller • recovered the burned bus and the control logic n.4 • Storage: support call • Farming: took the chance to apply all security patches and latest kernel to nodes • Switch on order: LSF server, CEs, UIs • For a moment we were thinking about upgrading to LSF 9 Andrea Chierici

  18. Failures (1) • Old WNs • BIOS battery exhausted, configuration reset • PXE boot, hyper-threading, disk configuration (AHCI) • lost IPMI configuration (30% broken) Andrea Chierici

  19. Failures (2) • Some storage controllers were replaced • 1% PCI cards (mainly 10Gbit network) replaced • Disks, power supplies and network switches were almost not damaged Andrea Chierici

  20. What we learned

  21. We fixed our weak point Ctrl sys Pow 3 Ctrl sysPow 2 Ctrl sys Pow1 Ctrl sys Pow 6 Ctrl sys Pow 4 Ctrl sys Pow 5 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Andrea Chierici

  22. We miss an emergency button • Shut the center down is not easy: a real “emergency shutdown” procedure is missing • We could have avoided switching down the whole center if we have had more control • Depending on the incident, some services may be left on-line • Person on-call can’t know all the site details Andrea Chierici

  23. Hosted services • Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control • We need an emergency procedure for those too • We need a better understanding of the SLAs Andrea Chierici

  24. Conclusions

  25. We benchmarked ourselves • It took 2 days to get the center back on-line • less than one to open LHC experiments • everyone was aware about what to do • All working nodes rebooted with a solid configuration • A few nodes were reinstalled and put back on line in a few minutes Andrea Chierici

  26. Lesson learned • We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) • The new dashboard appears to be the right place • We created a task-force to implement a controlled shutdown procedure • Establish a shutdown order • WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches • In case of emergency, on-call person is required to take a difficult decision Andrea Chierici

  27. Testing shutdown procedure • The shutdown procedure we are implementing can’t be easily tested • How to perform a “simulation”? • Doesn’t sound right to switch the center off just to prove we can do it safely • How do other sites address this? • Should periodic bios battery replacements be scheduled? Andrea Chierici

More Related