260 likes | 271 Views
This International Symposium on Grid Computing 2007 presentation by Jeremy Coles discusses high-level metrics, availability and reliability, current activities, problems and issues, and the future of HEP grid computing in the UK, with a focus on the transition to the LHC era.
E N D
HEP Grid Computing in the UKmoving towards the LHC era International Symposium on Grid Computing 2007 Jeremy Coles J.Coles@rl.ac.uk 27th March 2007
1 High-level metrics and results 2 Availability and reliability 3 Current activities 4 Problems and issues 5 The future 6 Summary
Progress in deploying CPU resources UKI is ~23% (~20% at GridPP17) of the contribution to EGEE. Max EGEE = 46795 Max UKI = 10393 http://goc.grid.sinica.edu.tw/gstat/UKI.html
Estimated utilisation based on gstat job slots/usage } Guess what happened
Usage of CPU is still dominated by LHCb Two weeks ago Note the presence of the new VOs camont and totalep and also many non-LHC HEP collaborations. LHCb has been a major user of the CPU resources. There will be increasing competition as ATLAS and CMS double their MC targets every 3-6 months in 2007. https://gfe03.hep.ph.ic.ac.uk:4175/cgi-bin/load
2006 Outturn Many sites have seen large changes this year. For example: 1) Glasgow’s new cluster August 28 September 1 October 13 2) RAL PPD – new CPU commissioned: T2 disk 55% T2 CPU 119% T1 disk48% T1 CPU 64%
LCG Disk Usage 21.5% 122.3%
Storage accounting has become more stable http://www.gridpp.ac.uk/storage/status/gridppDiscStatus.html 2007 ATLAS and LHCb have been steadily increasing their stored data across UK sites. A new issue is dealing with full Storage Elements.
We need to address the KSI2K/TB ratios with future purchases “The LCG experiments need well defined ratios of Tier-2 CPU to Disk (KSI2K/TB) which are about 2 for ATLAS, 3 for CMS, and almost entirely CPU for LHCb.”
Overall stability is improving but this picture is not seen very often! } All SAM results green! This lasted for a single iteration of the tests. A production grid should have this as the norm.
The GridView availability figures need cross-checking but we will use them to start setting Tier-2 targets to improve performance “…BDII failures in SRM and SE tests which were the problem. More that that, it was not failures in the site designated BDII, but that the BDII had been hardcoded to sam-bdii.cern.ch - and all the failures were from this component.” ScotGrid Blog 23rd March. https://gus.fzk.de/pages/ticket_details.php?ticket=19989 Targets For April >80% For June >85% For July >90% For Sept+ >95%
From a user perspective things are nowhere near stable so we need other measures – these are UK ATLAS test results starting Jan 2007 http://hepwww.ph.qmul.ac.uk/~lloyd/atlas/atest.php … but note that there is no special priority for these test jobs. Average job success Date
From a user perspective things are nowhere near stable so we need other measures – this is the UK ATLAS test average success rate Average job success More sites and tests introduced System failures Date
From a user perspective things are nowhere near stable so we need other measures – one example of a service problem encountered Matching on one RB is too slow. Do we spend time investigating the underlying problem? Average job success Date
Current networking (related) activities LAN tests: Example rfio test on DPM at Glasgow (shown at WLCG workshop). Other sites are beginning such testing. T0 -> T1 Still uncovering problems with CASTOR T1 -> T2 T2 -> T2 T1 to T2 - Target rate 300Mb/s or better Intra T2 - Target rate 200Mb/s or better reading / writing. 2008 targets ~1Gb/s Bottleneck Tier-1 outbound
GridMon – network monitoring Network monitoring control nodes have been placed at each site to provide a “background” for network analysis } Tier-2 examples http://gridmon3.dl.ac.uk/gridmon/graph.html T1 example
Issues that we are facing • As job loads increase the information system is showing some signs of lack of scaling [query static information at site] • The regular m/w releases have made things easier. Recently several problems noted (eg. With gLite 3.0 r16 DPM & Torque/Maui – first “major” gLite update) • Site administrators are having to quickly learn new skills to keep up with growing resource levels and stricter availability requirements. Improved monitoring is likely to be a theme for this year. • Workarounds are required in several areas especially to compensate for lack of VOMS aware middleware – such as enabling job priorities on batch queues and access control on storage (the ATLAS ACL request caused concern and confusion!) • Memory per job vs core vs 64-bit may become an issue as could available WN disk space unless we have a clear strategy • Setting up disk pools also requires careful thought since disk quotas are not settable (disks are filling up and sites fail SAM tests) • The middleware is still not available for SL4 32-bit let alone SL4 64-bit.
Grid Services Grid/Support Fabric (H/W and OS) CASTOR SW/Robot 1xstorage 2xexperiment 1xPPS 1xVacancy 1xH/W 1xMonitoring 1xH/W disc support 1x OS support 2x HW (disc) support 1x service manager 1xH/W manager 1xCASTOR SRM 1xgeneral 1xLSF +additional effort Project Management (several parts) Machine Room operations Database Support (1x) Networking Support Tier-1 organisation going forward in 2007
Disk and CASTOR Problems throughout most of 2006 with new disk purchases. Drives ejected from arrays under normal load. Many theories put forwards and tests conducted. WD using analyzers on SATA interconnects at two sites uncovered problem as due to drive head staying in one place for too long! Fixed with a “return following reposition” firmware update. Disk now all deployed or ready to be deployed. • Successes • Experiments see good rates (progress pushed by CSA06) • More reliable than dCache • Success with disk1tape0 service • Bug-fix release should solve many problems • CASTOR: ongoing problems • Garbage collection does not always work • Under heavy load, jobs submitted to wrong servers (ones not in correct storage class) -> jobs hang. Jobs in LSF queue build up -> can be catastrophic! • File systems within disk pool fill unevenly • Tape migration sometimes loops -> mount/unmount without any work done • Significant number of disk-to-disk copies hang • Unstable releases – need more testing • Lack of admin tools (configuration risky and time consuming) and good user interface • No way to throttle job load • Logging is inadequate (problem resolution difficult) • …. Some of these being addressed but current situation has impacted experiment migration to CASTOR and requires more support than anticipated.
CPU efficiency – still a concern at T1 and some T2 sites ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this fell to ~75% target Each experiment needs to work to improve their system/deployment practice anticipating e.g. hanging gridftp connections during batch work
CPU efficiency – still a concern at T1 and some T2 sites ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this fell to ~75% ---- 47% target Each experiment needs to work to improve their system/deployment practice anticipating e.g. hanging gridftp connections during batch work
#Aborted Jobs Problem solved Home dir full 6-month focus – with LHC start-up in mind 1. Improve monitoring at (especially) T2 sites and persuade more sites to join in with cross-site working 2. Strengthen experiment & deployment team interaction 3. Site readiness reviews. These have started (site visits plus questionnaire) 4. Continue site testing – now using experiment tools 5. Address the KSI2K/TB ratios
2006 2007 2008 GridPP3 GridPP2 GridPP2+ 14TeV LHC: 900GeV 14 TeV Collisions First Collisions EGI ? EDG EGEE-I EGEE-II LHC Data Taking 2001 2002 2003 2004 2005 2006 2007 GridPP1 GridPP2 GridPP3 The Future 1: GridPP3 Project Approved Two weeks ago, funding for GridPP3 was announced: £25.9m of new money plus contingency etc (£30m project). GridPP1: Sep 2001 – Sep 2004, £17.0m or £5.7m/year. GridPP2: Sep 2004 – Sep 2007, £15.9m or £5.3m/year. GridPP2+: Sep 2007 – Apr 2008 GridPP3: Apr 2008 – Apr 2011, £25.9m or £7.2m/year. In the current UK funding environment this is a very good outcome. The total funding is consistent with a “70%” scenario presented to the project review body.
UK e - Infrastructure get common access, tools, information, Users Nationally supported services, through NGS HPCx + HECtoR Regional and Campus grids Community Grids Integrated internationally LHC VRE, VLE, IE ISIS TS2
The Future 2: Working with the UK National Grid Service -> EGI NGS components • Heterogeneous hardware & middleware • Computation services: based on GT2 • Data services: SRB, Oracle, OGSA-DAI • NGS portal, P-Grade • BDII using GLUE schema: from 1995 • RB - gLite WMS-LB: deployed Feb 2007 Interoperability • NGS RB+BDII currently configured to work with: • core NGS nodes (Oxford,Leeds,Manchester,RAL) • Other NGS sites to follow soon • GridPP sites reporting to RAL GridPP resources • GridPP and the NGS VO (ngs.ac.uk) • YAIM problem for DNS-style VO names. Workaround is to edit GLUE static information & restart GRIS but sites Issues: Direction of CE, policies, … With input from Matt Viljoen
1 Metrics – CPU on track and disk to catch up as demand increases 2 Availability – many sources. Working on targets. 3 Ongoing work – e.g. monitoring and bandwidth (WAN & LAN) testing 4 Tier-1 is shaping up. Fabric better but CASTOR a concern 5 GridPP3 is funded. Interoperation with NGS is progressing 6 Still lots to do … many areas not mentioned! Acknowledgments & references: This talk relies on contributions from many sources but notably from talks given at GridPP18: http://www.gridpp.ac.uk/gridpp18/. Most material is linked from here: http://www.gridpp.ac.uk/deployment/links.html. Our blogs and wiki (http://www.gridpp.ac.uk/w/index.php?title=Special:Categories&article=Main_Page) may also be of interest.