100 likes | 225 Views
Alarms in CC. A brief overview of alarms’ management in the computer centre. Introduction. All alarms of the computer centre are presented to the Operators 24/7 Alarm means that an action is required Can range from simply logging the event to calling out an expert at 2:00 AM.
E N D
Alarms in CC A brief overview of alarms’ management in the computer centre
Introduction • All alarms of the computer centre are presented to the Operators 24/7 • Alarm means that an action is required • Can range from simply logging the event to calling out an expert at 2:00 AM
Computers’ alarms workflow Contract type E if not administrated by SysAdmins Contract type D
A typical use case • Log each event (alarm) • Do not analyze a situation • Apply procedures ! • Step by step • Tools must be provided, access granted • Fix and/or escalate problems • Allowed to call people outside working hours LAS OPM SDB
Use case: steps 1 & 2 • OPM (web server) returns a ranked list of matching procedures • Operator selects most appropriate • LAS is a web based GUI
Use case: steps 3 & 4 • SDB (Service DB) lists Services Managers • Use the short URL provided at the bottom of each page to reference them in procedures! • Procedure content • List of nodes (applies to) • One entry per alarm • Commands to type are highlighted • Support links to SDB
Service Managers’ Controls Not covered Do-it-yourself (tuning, corrective actions, etc...) We pay for a number of alarms per month Assistance needed (out of working hours, h/w faults, etc...)
Providing procedures • In General: • Only Service Managers know what to do if anything goes wrong on their service(s) • Simple or urgent actions Operators • e.g. reboot machine, take it out of production, ... • More complex solutions SysAdmins • e.g. regenerate certificates, looking in log files, ... • Different Service Managers: • Application SM: service related procedures • Infrastructure SM: machine related procedures
Providing procedures: GUI • http://cern.ch/service-cc-opm/ (demo) • Hints / restrictions: • Quick help for impatient (6 steps) • Start from proposed template (Operators) • Save locally, edit, upload new procedure • Validation! • Further edits can be done on-line (IE vs FF)
Useful links • Service Managers Guidelines • http://cern.ch/it-div-fio-sao/guides/SM_guidelines.htm • Lemon alarm system • http://cern.ch/lemon-status/ • Using the SysAdmin Service • http://cern.ch/service-cc-sysadmin/SM_guidelines.htm