160 likes | 381 Views
Operation team at Ccin2p3. Suzanne Poulat – suzanne@in2p3.fr. Overview. Operation Team Organisation Operation’s role Services during out of working hours Tools Monitored services Examples. Operation team. Two groups : Support and Operation Support (9 persons) :
E N D
Operation team at Ccin2p3 Suzanne Poulat – suzanne@in2p3.fr
Overview • Operation Team • Organisation • Operation’s role • Services during out of working hours • Tools • Monitored services • Examples Suzanne Poulat - suzanne@in2p3.fr
Operation team • Two groups : Support and Operation • Support (9 persons) : • general user support, • dedicated persons for LHC experiments, • help-desk(Xhelp), • opening CC to collaborations and other sciences • Operation : details follow Suzanne Poulat - suzanne@in2p3.fr
Organisation • Ten persons in the group • two for Grid coordination • Four for Operation • Four operators in shift to cover 08:00AM to 09:PM 7/7 • on a weekly basis : • one person for operation (often 1.5) • The others have tasks as developments, monitoring or administrative tasks Suzanne Poulat - suzanne@in2p3.fr
Operation’srole • Check the avalaibility of all services (storage, cpu,…) • Optimize service usage • Insure that commitments of CCIN2P3 for the experiments and Grid VOs are respected • Organize the scheduledshutdowns • Coordinate actions duringunscheduleddowntimes • Monitoring and management of tape libraries • Create and manage accounts and AFS space • Organize the « on duty » service Suzanne Poulat - suzanne@in2p3.fr
Services - Out of workinghours • On site night security guard from 6PM to 8AM and weekends • no computing actions : Alerting and Messaging • 1 on-duty engineer (evenings, weekends) • Corrective actions if possible (documentations, Training) • else call an expert … if available • Weekend : 1 operator on site (10AM – 5PM) • first low level action • else call on-duty engineer • Result is a « Best effort » coverage Suzanne Poulat - suzanne@in2p3.fr
tools • Monitoring tool : NGOP -> Nagios • RemoteLogging Service : RLS • Mails • Tickets from local and gridusers : Xhelpinterfacedwith GGUS at CC • Web pages on the current state of services • Wiki for documentation, recipes, shutdowns, postmortemanalysis • log of the daily production : ELog • Tickets web page for tapes and drives incidents (~50 incidents per month : 10 drives, 40 tapes with 2 lost of data) • Scripts to analyse faulty tapes Suzanne Poulat - suzanne@in2p3.fr
Monitored services • BQS • Storage : HPSS, dCache, AFS • Grid : CE, SRM, TOP BDII • Databases • Others : Tape libraries, Saphir (privileges and location of services) • Workers and all servers Suzanne Poulat - suzanne@in2p3.fr
Nagios Suzanne Poulat - suzanne@in2p3.fr
SMURF Suzanne Poulat - suzanne@in2p3.fr
Anastasie – Running jobs Suzanne Poulat - suzanne@in2p3.fr
Xhelp Suzanne Poulat - suzanne@in2p3.fr
Xhelp (2) ~320 tickets by month = 10 to 20 tickets by days Suzanne Poulat - suzanne@in2p3.fr
Xhelp (3) Suzanne Poulat - suzanne@in2p3.fr
implementations • Wiki Operation • Nagios monitoring • Ovax • Users database Interface • Incidents robotique • On duty tools Suzanne Poulat - suzanne@in2p3.fr
QUESTIONS ? Suzanne Poulat - suzanne@in2p3.fr