SouthGrid Status

SouthGrid Status Pete Gronbech: 2nd April 2009 GridPP22 UCL

UK Tier 2 reported CPU – Historical View to Q109

UK Tier 2 reported CPU – Q1 2009

SouthGrid SitesAccounting as reported by APEL

Job distribution

Site Upgrades since gridpp21 • RALPPD Increase of 640 cores (1568KSI2K) +380TB • Cambridge 32 cores (83KSI2K) + 20TB • Birmingham 64 cores on pp cluster and 128 cores HPC cluster which add ~430KSI2K • Bristol original cluster replaced by new quad cores systems 16 cores + increased share of the HPC cluster 53KSI2k + 44TB • Oxford extra 208 cores 540KSI2K + 60TB • Jet extra 120 cores 240KSI2K

New Total Q109SouthGrid GridPP CPU (kSI2K) Storage (TB) EDFA-JET 483 1.5 700 90 Birmingham 120 55 Bristol 455 60 Cambridge 972 160 Oxford 2815 633 RALPPD 5545 999.5 Totals

MoU

Network rate capping • Oxford recently had its network link rate capped to 100mbs • This was as a result of continuous 300-350mbs traffic caused by CMS commissioning testing. • As it happens this test completed at the same time as we were capped, so we passed the test, and current normal use is not expected to be this high • Oxfords Janet link is actually 2*1gbit links which had become saturated. • Short term solution is to only rate cap JANET traffic to 200mbs, all other on site traffic remains at 1gbs. • Long term plan is to upgrade the JANET link to 10gbs within the year.

spec benchmarking • Purchased the SPEC 2006 benchmark suite • Ran using the Hepix scripts to run the HEPspec06 way • Using the HEP spec benchmark should provide a level playing field. • In the past sites could choose any one of the many published values on the spec benchmark site.

Staff Changes • Jon Waklin and Yves Coppens left in Feb 09 • Kashif Mohammad started in Jan 09 as the deputy coordinator for SouthGrid. • Chris Curtis will replace Yves starting in May. He is currently doing his PhD on the Atlas project. • The Bristol post will be advertised, it is jointly funded by IS and GridPP.

gridppnagios

Resilience • What do we mean by resilience? • The ability to maintain high availability and reliability of our grid service • Guard against failures • Hardware • Software

Availability / Reliability

Hardware Failures • The hardware • Critical Servers • Good quality equipment • Dual PSU • Dual mirrored systems disks and RAID for storage arrays • All systems have 3 year maintenance with on site spares pool. (disks, psu’s, ipmi cards) • Similar kit bought for servers so can swap h/w. • IPMI cards allow remote operation and control • The environment • UPS for critical servers • Network connected PDU’s for monitoring and power switching • Professional Computer room / rooms • Air Conditioning: Need to monitor the temperature • Actions based on the above environmental monitoring • Configure your UPS to shutdown systems in the event of sustained power loss • Shutdown cluster in the event of high temperature

Hardware continued • So having guarded against the h/w failing if it does then we need to ensure rapid replacement • Restore from backups or reinstall • Automated installation system; • pxe, kickstart, cfengine • Good documentation • Duplication of Critical Servers • Multiple ce’s • Virtualisation of some services allows migration to alternative VM servers (mon, bdii, and ce’s) • Less reliance on external services • Could setup Local WMS, Top level BDII

Software Failures • Main cause of loss of availability is software failures • Miss configuration • Fragility of glite middleware • OS system problems • Disks filling up • Service failures (eg ntp) • Good communications can help solve problems quickly. • Mailing lists, wikis, blogs, meetings, • Good monitoring and alerting (Nagios etc) • Learn from mistakes. Update systems and procedures to prevent reoccurrence.

Recent example • Many SAM failures occasional passes • All test jobs pass • Almost all ATLAS jobs pass • Error logs revealed messages about proxy not being valid yet! • ntp on se head node had stopped • AND cfengine had been switched off on that node (so no automatic check and restart) • SAM test always gets a new proxy and if it got through the WMS and on to our cluster in to a reserved express queue slot within 4 mins would fail. • In this case the SAM tests were not accurately reflecting the usability of our cluster BUT it was showing a real problem.

Conclusions • These systems are extremely complex • Automatic configuration and good monitoring can help but systems need careful tending • Sites should adopt best practice and learn from others • We are improving but its an ongoing task

SouthGrid Status