1 / 10

On-line Computing M&O

On-line Computing M&O. LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders. Introduction. Questions raised by RRB Scrutiny Group: System managers profiles Number of system managers M&O budget category Replacement profile of computer/network equipment

badru
Download Presentation

On-line Computing M&O

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders P. Vande Vyvre CERN-PH

  2. Introduction • Questions raised by RRB Scrutiny Group: • System managers profiles • Number of system managers • M&O budget category • Replacement profile of computer/network equipment • Common answer from 4 LHC experiments • See also presentation by A. Ceccucci to RRB SG in April 2003 on M&O for Online Computing P. Vande Vyvre CERN-PH

  3. System managers profiles • Continuity is needed for Level-2 and supervisor personnel P. Vande Vyvre CERN-PH

  4. System management effort (1) • Estimates based on LCG guidelines: fixed number of boxes (PC, network switch, storage element) per system manager • Differences between online and offline systems: • Wide variety of equipment used as a single system • Various PCs with different configurations (trigger farms, dataflow, control, monitoring, file servers) • Variety compounded by staged procurements • Very large and highly loaded network (event building e.g.) • Failure of any part of the online system will reduce efficiency of data-taking partially (loss of HLT sub-farm e.g.) or will interrupt data taking (failure of central controller) i.e. we have to run a complete coherent system • Dedicated team with appropriate skills needed to ensure reliability and optimal capacity of the online systems P. Vande Vyvre CERN-PH

  5. System management effort (2) • Manpower from collaboration ? • LHC collaborations are very large but attempts to find suitably qualified effort for system manager have failed even to meet today’s needs • Most people (physicists, engineers) do not have the right profile • Institutes who have people with proper qualifications not prepared to locate them at CERN for adequate periods • Full operation • 24/7 cover at Level-1, normal working hours at Level-2 + service piquet • At least 5 people Level-1 and 5 people Level-2. Reduced by some overlap • Shift crew will contribute to Level-1 • Provisional estimates to be adapted (2008-9) following experience of running the system and a better knowledge of the system reliability P. Vande Vyvre CERN-PH

  6. System management effort (3) Total effort in FTEs (Level1 and Level2 + Supervisor) P. Vande Vyvre CERN-PH

  7. M&O budget category • M&O A • Request of CERN management and RRB • No other identified source P. Vande Vyvre CERN-PH

  8. Replacement of equipment (1) • Equipment: PCs, network, and storage used for dataflow and online trigger • Motivations: • Reliability of equipment as it ages • Maintainability after a few years (3 years warranty) • Suitability of old equipment to follow evolution of operating system and to work with new equipment • Need to follow Operating System (OS) evolution: • Security patches • New PCs (staged installation) not supported by old OS versions • Old OS versions not supported • Code will continue to be developed with dependencies on the OS and compiler versions • Online trigger code based/using offline code developed for current OS version P. Vande Vyvre CERN-PH

  9. Replacement of equipment (2) • Categories • Disk and fileservers: lower reliability and very rapid evolution. 3 years • PCs: 4 years • Replacement cost will not directly follow Moore’s Law: I/O performance limitations, new multi-core architecture might require major increase in system memory • Network • Central switch: 5 years (= period of maintenance by manufacturer) • Smaller peripheral switches: 4 years (shorter warranty but less critical) P. Vande Vyvre CERN-PH

  10. Previous practice • LEP and fixed target era: • Computers were complete systems qualified by a commercial company • Maintenance contract to paid by experiments • System managers in experiments (some CERN staff) • CERN had operators staff in the computing center and in groups giving support to experiments • LHC era: • Components tested, qualified and assembled into complete systems by the experiments • Overall system much larger and complex than previously • Very few operators at CERN directly employed by CERN P. Vande Vyvre CERN-PH

More Related