On-line Computing M&O

On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders P. Vande Vyvre CERN-PH

Introduction • Questions raised by RRB Scrutiny Group: • System managers profiles • Number of system managers • M&O budget category • Replacement profile of computer/network equipment • Common answer from 4 LHC experiments • See also presentation by A. Ceccucci to RRB SG in April 2003 on M&O for Online Computing P. Vande Vyvre CERN-PH

System managers profiles • Continuity is needed for Level-2 and supervisor personnel P. Vande Vyvre CERN-PH

System management effort (1) • Estimates based on LCG guidelines: fixed number of boxes (PC, network switch, storage element) per system manager • Differences between online and offline systems: • Wide variety of equipment used as a single system • Various PCs with different configurations (trigger farms, dataflow, control, monitoring, file servers) • Variety compounded by staged procurements • Very large and highly loaded network (event building e.g.) • Failure of any part of the online system will reduce efficiency of data-taking partially (loss of HLT sub-farm e.g.) or will interrupt data taking (failure of central controller) i.e. we have to run a complete coherent system • Dedicated team with appropriate skills needed to ensure reliability and optimal capacity of the online systems P. Vande Vyvre CERN-PH

System management effort (2) • Manpower from collaboration ? • LHC collaborations are very large but attempts to find suitably qualified effort for system manager have failed even to meet today’s needs • Most people (physicists, engineers) do not have the right profile • Institutes who have people with proper qualifications not prepared to locate them at CERN for adequate periods • Full operation • 24/7 cover at Level-1, normal working hours at Level-2 + service piquet • At least 5 people Level-1 and 5 people Level-2. Reduced by some overlap • Shift crew will contribute to Level-1 • Provisional estimates to be adapted (2008-9) following experience of running the system and a better knowledge of the system reliability P. Vande Vyvre CERN-PH

System management effort (3) Total effort in FTEs (Level1 and Level2 + Supervisor) P. Vande Vyvre CERN-PH

M&O budget category • M&O A • Request of CERN management and RRB • No other identified source P. Vande Vyvre CERN-PH

Replacement of equipment (1) • Equipment: PCs, network, and storage used for dataflow and online trigger • Motivations: • Reliability of equipment as it ages • Maintainability after a few years (3 years warranty) • Suitability of old equipment to follow evolution of operating system and to work with new equipment • Need to follow Operating System (OS) evolution: • Security patches • New PCs (staged installation) not supported by old OS versions • Old OS versions not supported • Code will continue to be developed with dependencies on the OS and compiler versions • Online trigger code based/using offline code developed for current OS version P. Vande Vyvre CERN-PH

Replacement of equipment (2) • Categories • Disk and fileservers: lower reliability and very rapid evolution. 3 years • PCs: 4 years • Replacement cost will not directly follow Moore’s Law: I/O performance limitations, new multi-core architecture might require major increase in system memory • Network • Central switch: 5 years (= period of maintenance by manufacturer) • Smaller peripheral switches: 4 years (shorter warranty but less critical) P. Vande Vyvre CERN-PH

Previous practice • LEP and fixed target era: • Computers were complete systems qualified by a commercial company • Maintenance contract to paid by experiments • System managers in experiments (some CERN staff) • CERN had operators staff in the computing center and in groups giving support to experiments • LHC era: • Components tested, qualified and assembled into complete systems by the experiments • Overall system much larger and complex than previously • Very few operators at CERN directly employed by CERN P. Vande Vyvre CERN-PH

On-line Computing M&O