1 / 31

Tier-1 Status

Tier-1 Status. Andrew Sansum GRIDPP20 12 March 2008. Tier-1 Capacity delivered to WLCG (2007). RAL. RAL. Tier-1 CPU Share by 2007 MoU. Wall Time. CPU Use by VO (2007). ATLAS. ALICE. CMS. LHCB. Experiment Shares (2008). Grid Only.

myra-joseph
Download Presentation

Tier-1 Status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier-1 Status Andrew Sansum GRIDPP20 12 March 2008

  2. Tier-1 Capacity delivered to WLCG (2007) RAL RAL

  3. Tier-1 CPU Share by 2007 MoU

  4. Wall Time

  5. CPU Use by VO (2007) ATLAS ALICE CMS LHCB

  6. Experiment Shares (2008)

  7. Grid Only • Non-Grid access to Tier-1 has now ended. Only special cases (contact us if you believe you are) now have access to: • UIs • Job Submission • Until end of May 2008 • IDs will be maintained (disabled) • Home directories will be maintained online • Mail forwarding will be maintained. • After end of May 2008 • Ids will be deleted • Home filesystem will be backed up • Mail spool will be backed up • Mail forwarding will stop • AFS service continues for Babar (and just in case)

  8. Reliability • Feb mainly due to power failure + 8 hours network • Jan/December mainly CASTOR problems over Xmass period (despite multiple callouts) • Out of hours on-call will help but some problems take time to diagnose/fix

  9. Power Failure: Thursday 7th February 13:00 • Work on power supply since December • Down to 1 transformer (from 2) for extended periods (weeks). Increased risk of disaster. • Single transformer running at max operating load • No problems until work finished and casing closed – control line crushed and power supply tripped. • Total loss of power to whole building • First power interruption for over 3 years • Restart (Effort > 200 FTE hours) • Most Global/National/Tier-1 core systems up by Thursday evening • Most of CASTOR and part of batch up by Friday • Remaining batch on Saturday • Still problems to iron out in CASTOR on Monday/Tuesday • Lessons • Communication was prompt and sufficient but ad-hoc • Broadcast unavailable as RAL run GOCDB (now fixed by caching) • Careful restart of disk servers slow and labour intensive (but worked) will not scale See: http://www.gridpp.rl.ac.uk/blog/2008/02/18/review-of-the-recent-power-failure/

  10. Hardware: Disk • Production capacity: 138 Servers, 2800 drives, 850TB (usable) • 1.6PB capacity delivered in January by Viglen • 91 Supermicro 3U servers with dual AMD 2220E (2.8GHz) dual-core CPUs, 8GB RAM, IPMI • 1 x 3ware 4 port 9650 PCIe RAID controller with 2 x 250GB WD HDD • 1 x 3ware 16 port 9650 PCIe RAID controller with 14 x 750GB WD HDD • 91 Supermicro 3U servers with dual Intel E5310 (1.6GHz) quad-core CPUs, 8GB RAM, IPMI • 1 x 3ware 4 port 9650 PCIe RAID controller with 2 x 400GB Seagate HDD • 1 x 3ware 16 port 9650 PCIe RAID controller with 14 x 750GB Seagate HDD • Acceptance test running – scheduled to be available end of March. • 5400 spinning drives after planned phase out in April (expect drive failure every 3 days)

  11. Hardware: CPU • Production about 1500KSI2K on 600 systems. • Recently upgraded about 50% of capacity to 2GB/core • Recent procurement (approximately 3000KSI2K - but YMMV) delivered and under test • Streamline • 57 x 1U servers (114 systems, 3 racks), each system: • dual Intel E5410 (2.33GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD • Clustervision • 56 x 1U servers (112 systems, 4 racks), each system: • dual Intel E5440 (2.83GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD

  12. Hardware: Tape • Tape Drives • 8 9940B drives • Used on legacy ADS/dCache service – phase out soon • 18 T10K tape drives and associated servers delivered, 15 in production, remainder soon • Planned bandwidth 50MB/s per drive • Actual bandwidth (8-80MB/s) - a work in progress • Media • Approximately 2PB on site

  13. Hardware: Network RAL Site CPUs + Disks CPUs + Disks ADS Caches RAL Tier 2 N x 1Gb/s 2 x 5510 + 5530 3 x 5510 + 5530 5510 5530 10Gb/s Router A Firewall Force10 C300 8 slot Router (64*10Gb) Stack 4 x Nortel 5530 10Gb/s bypass OPN Router 10Gb/s Site Access Router 5 x 5510 + 5530 6 x 5510 + 5530 Oracle systems 1Gb/sLancaster (test network) 10Gb/s to SJ5 CPUs + Disks CPUs + Disks Tier 1 10Gb/s to CERN

  14. RAL links implemented Implement soon never

  15. Backplane Failures (Supermicro) • 3 servers “burnt out” backplane • 2 of which set off VESDA • 1 called out fire-brigade • Safety risk assessment: Urgent rectification needed • Good response from supplier/manufacturer • PCB fault in “bad batch” • Replacement nearly complete

  16. Machine Rooms • Existing Machine room • Approximately 100 racks of equipment • Getting close to power/cooling capacity • New Machine Room • Work still proceeding near to schedule • 800M**2 can accommodate 300 racks + 5 robots • 2.3MW Power/Cooling capacity (some UPS) • Scheduled to be available for September 2008

  17. CASTOR Memory Lane Happy days! 4Q05 1Q06 2Q06 3Q06 4Q06 1Q07 2Q07 3Q07 4Q07 1Q08 CASTOR1 tests OK 2.1.3 good but missing functionality 2.1.2 bad CASTOR2 Core Running Hard to install + dependencies ATLAS on CASTOR  CSA07 encouraging Problems with functionality and performance – it doesn’t work! OC Committees note improvement but concerned Service stopped for extended upgrade CSA08 reasonably successful 2.1.4 upgrade Goes well. Disk 1 support! CMS on CASTOR for CSA06. Encouraging. Declare production service. LHCB on CASTOR 

  18. Growth in Use of CASTOR

  19. Name Server +vmgr Name Server +vmgr Tape Server Test Architecture Oracle NS+ vmgr Oracle NS+ vmgr Shared Services Shared Services Tape Server Oracle DLF Oracle repack Oracle stager Oracle DLF Oracle DLF Oracle repack Oracle stager Oracle stager stager DLF stager stager DLF repack DLF+ LSF repack LSF LSF Preproduction Development Certification Testbed 1 Diskserver - variable 1 Diskserver - variable 1 Diskserver - variable

  20. Name Server 1 +vmgr Tape Server Tape Server Tape Server Tape Server Oracle stager Oracle DLF CASTOR Production Architecture Oracle NS+ vmgr Name Server 2 Shared Services Tape Server Tape Server Oracle stager Oracle DLF Oracle stager Oracle DLF Oracle DLF Oracle repack Oracle stager stager DLF stager DLF stager DLF stager DLF repack LSF LSF LSF LSF CMS Stager Instance Atlas Stager Instance LHCb Stager Instance Repack and Small User Stager Instance Diskservers Diskservers Diskservers 1 Diskserver

  21. D0T1 D1T0 Farm D1T1 D0T0 Atlas Data Flow Model AOD2 RAW RAW T0Raw simRaw AODm2/ TAG RAW TAG/ AODm2 RAW ESD2/ AODm2/ TAG T0 T2 T1’s ESD/ AODm/ TAG/ AODm1/ TAG AODm2/ TAG ESD1/ AODm1/ TAG ESD1 StripInput simStrip ESD Partner T1

  22. CMS Dataflow All pools are disk0tape1 FarmRead 50 LSF Slots Per server Batch Farm Recall Disk2Disk Copy T0, T1 & T2 WanIn 8 LSF Slots Per server Disk2Disk Copy WanOut 16 LSF Slots Per server T1 & T2

  23. CMS Disk Server Tuning: CSA06/CSA07 • Problem: network performance too low • Increase default/maximum tcp window size • Increase tcp ring buffers and tx queue • Ext3 journal changed to data=writeback • Problem: Performance still too low • Reduce number of gridftp slots/server • Reduce number of streams per file • Problem: Phedex transfers now timeout • Reduce FTS slots to match disk pools • Problem: servers sticky or crash with OOM • Limit total tcp buffer space • Protect low memory • Aggressive cache flushing • See: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Disk_Server_Tuning

  24. 3Ware Write Throughput

  25. CCRC08 Disk server Tuning • Migration rate to tape very bad (5-10MB/s) when concurrent with writing data to disk • Was OK in CSA06 (50MB/server) – Areca servers • 3Ware 9550 performance terrible under concurrent read/write (2MB/s read, 120MB/s write) • 3Ware appears to prioritise writes • Tried many tweaks, most with little success except • Either: changing elevator to anticipatory • Downside – write throughput reduced • Good under benchmarking - testing in production this week • Or: increasing block device read ahead • Read throughput high but erratic under test • But seems OK in production (30MB/server) See: http://www.gridpp.rl.ac.uk/blog/2008/02/29/3ware-raid-controllers-and-tape-migration-rates/

  26. CCRC (CMS – WANIN) 300MB/s Network In Network Out Phedex Migration queue Tier-0 Rate CPU

  27. CCRC (WANOUT) 300MB/s Network In Network Out Phedex Before Replication CPU After Replication

  28. CASTOR Plans for May CCRC08 • Still problems • Optimising end to end transfer performance remains a balancing act. • Hard to manage complex configuration • Working on • Alice/xrootd deployment • Preparation for 2.1.6 upgrade • Installation of Oracle RACS (resilient Oracle services for CASTOR) • Provisioning and configuration management

  29. dCache Closure • Agreed with UB that we would give 6 months notice before terminating dCache service • dCache closure announced to UB to be May 2008 • ATLAS and LHCB working to migrate their data • Migration slower than hoped • Service much reduced in size now (10-12 servers remain) and operational overhead much lower • Remaining non-LHC experiments migration delayed by low priority for non-CCRC work. • Work on Gen instance of CASTOR will recommence shortly. • Pragmatically – closure may be delayed by several months until Minos and tiny VOs migrated

  30. Termination of GRIDPP use of ADS Service • GRIDPP funding and use of old legacy Atlas Datastore service scheduled to end at end of March 2008. • No gridpp access by “tape” command after this • Also no access via C callable VTP interface • RAL will continue to operate ADS service and experiments are free to purchase capacity directly from Datastore Team. • Pragmatically closure cannot happen until: • dCache ends (uses ADS back end) • CASTOR is available for small VOs • Probably 6 months away

  31. Conclusions • Hardware for 2008 MoU in the machine room and moving satisfactorily through acceptance • Volume not yet a problem but warning signs starting to appear. • CASTOR situation continues to improve • Reliable during CCRC08 • Hardware performance improving. Tape migration problem reasonably understood and partly solved. Scope for further improvement • Progressing various upgrades • Remaining Tier-1 infrastructure essentially problem free. • Availability fair, but stagnating. need to progress: • Incident response staff • On-Call • Disaster Planning and National/Global/Cluster Resilience • Concerned that we still not seen all experiment use cases.

More Related