180 likes | 200 Views
D R A F T. Computing Fabric (CERN) , Status and Plans. View of different Fabric areas. Installation Configuration + monitoring Fault tolerance. Automatization, Operation, Control. Infrastructure Electricity, Cooling, Space. Batch system (LSF). Storage system (AFS, CASTOR). Network.
E N D
D R A F T Computing Fabric (CERN) , Status and Plans Bernd Panzer-Steindel, CERN, IT
View of different Fabric areas Installation Configuration + monitoring Fault tolerance Automatization, Operation, Control Infrastructure Electricity, Cooling, Space Batch system (LSF) Storage system (AFS, CASTOR) Network Benchmarks, R&D, Architecture GRID services !? Prototype, Testbeds Purchase, Hardware selection, Resource planning Coupling of components through hardware and software Bernd Panzer-Steindel, CERN, IT
Current relationship of the Fabric to other projects LCG -- Hardware resources -- Manpower resources Collaboration with India -- Monitoring -- Quality of Service • CERN IT • Main Fabric provider openlab -- 10 Gbit networking -- new CPU technology -- possibly , new storage technology GDB working groups -- Site coordination -- Common fabric issues SERCO --Sysadmin outsourcing EDG, WP4 -- Installation -- Configuration -- Monitoring -- Fault tolerance GRID Technology and deployment -- Common fabric infrastructure -- Fabric GRID interdependencies External network Bernd Panzer-Steindel, CERN, IT
Preparations for the LCG-1 service Two parallel coupled approaches : 1. Use the prototype to install pilot LCG-1 production services with the corresponding tools and configurations of the different middleware packages (EDG, VDT, etc.) 2. ‘Attach’ the Lxbatch production worker nodes carefully in a none intrusive way to the GRID services service nodes and worker nodes, the focus is here on the worker nodes increasing size from Pilot 1 (50 nodes , 10 TB) to the service in July (200 nodes, 20 TB) Bernd Panzer-Steindel, CERN, IT
Fabric Milestones for the LCG-1 service Production Pilot I starts 15.01.2003 Production Pilot 2 starts 17.04.2003 LCG-1 initial service 01.07.2003 7 days acceptance test 04.08.2003 Lxbatch job scheduler pilot 03.02.2003 Lxbatch replica manager pilot 01.09.2003 Lxbatch merges into LCG-1 17.10.2003 30 days acceptance test 28.10.2003 Fully operational LCG-1 Service & distributed production environment 24.11.2003 Bernd Panzer-Steindel, CERN, IT
Integration of the milestones with the GD area • Pilot-1 service – February 1, 2003. • 50 machines (CE), 10 TB (SE). Runs middleware currently on LCG testbeds. Initial testbed at CERN. • Add 1 remote site by February 28, 2003. • Pilot-2 service – March 15, 2003. • 100 machines (CE), 10 TB (SE). CERN service will run full prototype of WP4 installation and configuration system. • Add 1 US site to pilot – March 30, 2003 • Add 1 Asian site to pilot – April 15, 2003 • Add 2-3 more EU and US sites – April – May, 2003 • Service includes 6-7 sites – June 1, 2003 • LCG-1 initial production system – July 2003. • 200 machines (CE), 20 TB (SE). Uses full WP4 system with fully integrated fabric infrastructure. Global service has 6-7 sites in 3 continents. Fabrics project plan : http://lcg.web.cern.ch/LCG/PEB/Planning/PBS/LCG.mpp Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Infrastructure Vault conversion complete, migration of equipment from the centre has started Plans for the upgrade to 2.5 MW cooling and electricity supply are progressing well Worries : • Financing of this exercise • CPU power consumption development Performance per Watt is improving very little http://lcg.web.cern.ch/LCG/C-RRB/2002-05/RRB2_Report1610.doc https://web11.cern.ch/it-support-mrp/B513 Upgrade/ http://ref.cern.ch/CERN/IT/C5/2002/038/topic.html Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Operation, Control EDG WP4, 6 FTE Time schedule for delivery of installation, configuration, fault tolerance and monitoring aligned to the milestones of the LCG-1 service Successful introduction of a new Linux certification team (all experiments + IT) just released RH 7.3.1 Important also for the site coordination ( GDB WG4) Linux team increases next year from 3 to 4 (later 5) FTE Outsourcing contract (SERCO) for system administration ends in Dec 2003. Will be replaced by insourcing. ~ 10 technical engineers in the next years Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Networking Network in the computer center : 3COM and Enterasys equipment, 14 routers, 147 switches (Fast Ethernet and Gigabit) Stability : 29 interventions in 6 month , (resets, hardware failure, software bugs,etc.) Traffic : constant load of ~400 MB/s aggregate, no overload ~ 10 % load • 10 Gbit equipment • tests until mid 2003 • integration into the prototype mid 2003 • part integration into the backbone mid 2004 • full 10 Gbit backbone mid 2005 Refer to C5 Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Batch system Node Stability : 7 reboots per day + 0.7 Hardware interventions per day (mostly IBM disk problems) With ~700 nodes running batch jobs at ~ 65% cpu utilization, last 6 month • Successful introduction of share queues in LSF • optimization of general throughput Continuous work on Quality of Service (user interference, problem disentanglement) Statistics and monitoring http://it-div-fio-is.web.cern.ch/it-div-fio-is/Reports/Weekly_lsf_stats(all%20groups).xls • General survey of batch systems during 2004 • Based on the recommendations of the survey • a possible installation of a new batch is scheduled for 2005 Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Storage (I) Castor HSM System : 8 million files, 1.8 PB of data today Hardware stability : • ~ one intervention per week on one tape drive (STK 9940A) • ~ one tape with recoverable problems per 2 weeks( to be send to STK HQ) • ~ one disk server reboot per week (out of ~200 disk servers in production) • ~one disk error per week (out of ~3000 disks in production) 20 new tape drives (9940B) arrived and in are heavy usage right now IT Computing DCs and ALICE DC New disk server generation doubles the performance solves the tape server – disk server ‘impedance’ matching problem (disk I/O should be much faster than tape I/O) Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Storage (II) • Focus is on consolidation • Stager rewrite • Improved error recovery and redundancy • Stability • IT and ALICE DCs very useful • Details of the storage access methods need to be defined • and implemented until March 2003 • (Application I/O, transport mechanism, CASTOR • Interfaces, replica management middleware,etc.) • A survey of common storage solutions will start in July 2003 • Recommendation will be reported in July 2004 • Tests and prototype installation are planned from July 2004 to June 2005 • Deployment of the storage solution for LHC will start in July 2005 Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Resources Common planning for the 2003 resources (CPU, disk) combining PEB (Physics Data Challenges), LCG Prototype (Computing Data Challenges and general resources (COCOTIME) established. Very flexible policy to ‘move’ resources between the different areas, to achieve the highest possible resource optimization IT physics base budget for CPU and disk resources 1.75 million SFr in 2003 Advancement of 2004 purchases for the prototype are needed Non-trivial exercise with continuous adaptation necessary CERN purchasing procedures don’t make it easier http://doc.cern.ch/AGE/current/askArchive.php?a02155/a02155s1t3/transparencies/slides.ppt Bernd Panzer-Steindel, CERN, IT
Dual P4 node == 1300 SI2000 == 3000 SFr == 2.3 SFr/SI2000 Bernd Panzer-Steindel, CERN, IT
Computing model of the Experiments Data Challenges Experiment specific IT base figures Benchmark and analysis framework Components LINUX, CASTOR, AFS, LSF, EIDE disk servers, Ethernet, etc. Benchmark and performance cluster (current architecture and hardware) Architecture validation Criteria : Reliability Performance Functionality • R&D activities (background) • iSCSI, SAN, Infiniband • Cluster technologies PASTA investigation Status and plans, Fabric area : Architecture (I) Bernd Panzer-Steindel, CERN, IT
Status and plans, Fabric area : Architecture (II) • Regular checkpoints for the architecture verification • Computing data challenges (IT, ALICE-mass storage) Physics data challenges (no really I/O stress yet -- analysis) • Collecting the stability and performance measurements of the commodity hardware in the different fabric areas • Verifying interdependencies and limits • Definition of Quality of Services • Regular (mid 2003,2004,2005) reports on the status of the Architecture • TDR report finished by mid 2005 Bernd Panzer-Steindel, CERN, IT
LCG personal in the Fabrics area 3 Staff 2 Fellows 5 Coorporant/Students 1 external (PPARC, France, Spain, Israel) • Sysadmin support in the testbeds • Security, CA • Installation and monitoring • Remote reset and console • Fault tolerance • HSM interface to GRIDFTP • Benchmarks and cluster R&D Bernd Panzer-Steindel, CERN, IT
Conclusions • Architecture verification okay so far • Stability and performance of commodity equipment is satisfactory (good) • Major ‘stress’ (I/O) on the systems is coming from Computing DCs • and currently running experiments, not the LHC physics productions Worries : -- Computer centre infrastructure (finance and power) -- Analysis model and facility -- Quality of Service measurements -- Constraints imposed by the middleware • Remark : Things are driven by the market, not the pure technology • possible paradigm changes Bernd Panzer-Steindel, CERN, IT