290 likes | 442 Views
Grid Deployment & Operations in the UK. Wednesday 3 rd May ISGC 2006, Taipei. Jeremy Coles GridPP Production Manager UK&I Operations for EGEE J.Coles@rl.ac.uk. Overview. 1 Background to e-Science – The UK Grid Projects NGS & GridPP. 2 The deployment and operations models and vision.
E N D
Grid Deployment & Operations in the UK Wednesday 3rd May ISGC 2006, Taipei Jeremy Coles GridPP Production Manager UK&I Operations for EGEE J.Coles@rl.ac.uk
Overview 1 Background to e-Science – The UK Grid Projects NGS & GridPP 2 The deployment and operations models and vision 3 GridPP performance measures 4 Progress in GridPP against LCG requirements 5 Future plans 6 Summary
UK e-Science • National initiatives began in 2001 • UK e-Science programme • Application focused/led developments • Varying degree of “infrastructure” … • ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ • John Taylor • Director General of Research Councils • Office of Science and Technology http://www.rcuk.ac.uk/escience/
UK e-Infrastructure directions Regional and Campus grids VRE, VLE, IE HPCx + HECtoR LHC ISIS TS2 Users get common access, tools, information, Nationally supported services, through NGS Community Grids Integrated internationally
Applications • Thermodynamic integration • Molecular dynamics • Systems biology • Neutron scattering • Econometric analysis • Climate modelling • Nano-particles • Protein folding • Ab-initio protein structure prediction • radiation transport (radiotherapy) • IXI (medical imaging) • Biological membranes • Micromagnetics • Archeology • Text mining • Lattice QCD (analysis) • Astronomy (VO services) • Many, but not all, applications cover traditional computational sciences • Both user and pre-installed software • Several data focused activities • Common features are • Distributed data and/or collaborators • Not just pre-existing large collaborations • Explicitly encourage new users • Common infrastructure/interfaces
The UK & Ireland contribution to EGEE SA1 – deployment & operations • Consisted of 3 partners in EGEE-I: • The National Grid Service (NGS) • Grid Ireland • GridPP
The UK & Ireland contribution to EGEE SA1 – deployment & operations • Consisted of 3 partners in EGEE-I: • The National Grid Service (NGS) • Grid Ireland • GridPP • Grid-Ireland focus: • National computational grid for Ireland built over the Higher Education Authority network • Central operations from Dublin • Have developed an auto-build system for EGEE componenets
The UK & Ireland contribution to EGEE SA1 – deployment & operations • Consisted of 3 partners in EGEE-I: • The National Grid Service (NGS) • Grid Ireland • GridPP - Composed of 4 regional Tier-2s and a Tier-1 as per the LCG Tier model • In EGEE-II: • NGS and Grid-Ireland unchanged • The lead institute in each of the GridPP Tier-2s becomes a partner.
Focus: GridPP structure Oversight Committee Collaboration Board Project Management Board Deployment Board Tier-2 Board User Board Tier-1 Board Tier-1 Manager Production Manager Helpdesk support NorthGrid Coordinator SouthGrid Coordinator ScotGrid Coordinator London Tier-2 Coordinator Tier-1 Technical Coordinator Catalogue support Tier-2 support Tier-2 support Tier-2 support Tier-2 support Tier-1 support & administrators Storage Group Site Administrator Site Administrator Site Administrator Site Administrator Networking Group VOMS support
GridPP structure and work areas Oversight Committee Collaboration Board Project Management Board Deployment Board Tier-2 Board User Board Tier-1 Board Tier-1 Manager Production Manager Helpdesk support NorthGrid Coordinator SouthGrid Coordinator ScotGrid Coordinator London Tier-2 Coordinator Tier-1 Technical Coordinator Catalogue support Tier-2 support Tier-2 support Tier-2 support Tier-2 support Tier-1 support & administrators Storage Group Site Administrator Site Administrator Site Administrator Site Administrator Networking Group VOMS support Example activities from across these areas • Supporting dCache • Supporting DPM • Developing plug-ins • Constructing data views • Supporting network testing • Running core services • Ticket process management • Pre-production service • UK testzone • Pre-release testing • Deployment of new hardware • Information exchange • Maintaining site services • Maintaining production services • LCG service challenges • GridPP challenges • Monitoring use of resources • Reporting • Running helpdesks • Interoperation – parallel deployment • Updating project plans • Agreeing resource allocations • Checking project direction • Tracking documentation • VO interaction/support • Portal development Recent output from SOME areas follows….
How effectively are resources being used? Tier-1 developed script uses one simple measure: sum(CPU time) / sum(wall time). Low efficiencies for 2005 were generally a few jobs making the situation look bad. 2006 Problems with SEs http://www.gridpp.rl.ac.uk/stats/
RTM data views - efficiency What are the underlying reasons for big differences in overall efficiency * Data shown for Q42005 http://gridportal.hep.ph.ic.ac.uk/rtm/reports.html
RTM data views - usage Does the usage distribution make sense? http://gridportal.hep.ph.ic.ac.uk/rtm/reports.html * Data shown for Q42005
RTM data views – job distribution Operations needs to check mappings and discover why some sites not used * Data shown for Q42005 http://gridportal.hep.ph.ic.ac.uk/rtm/reports.html
Site performance measures • Storage provided
Site performance measures • Storage provided • Scheduled downtime
Site performance measures • Storage provided • Scheduled downtime • Estimated occupancy
Site performance measures • Storage provided • Scheduled downtime • Estimated occupancy • SFT failures
Site performance measures • Storage provided • Scheduled downtime • Estimated occupancy • SFT failures • Tickets & responsiveness
Site performance measures • Storage provided • Scheduled downtime • Estimated occupancy • SFT failures • Tickets & responsiveness • # VOs supported
Site performance measures • Storage provided • Scheduled downtime • Estimated occupancy • SFT failures • Tickets & responsiveness • # VOs supported • + others….. • WHAT MAKES A SITE BETTER (beyond manpower)? • Need more data over longer periods • Ideally need more automated data! • Importance will increase in meeting MoU/SLA targets • How reliable are the metrics
Meeting the LCG challengeExample: Tier-2 individual transfer tests Receiving RAL Tier-1 Lancaster Manchester Edinburgh Glasgow Birmingham Oxford Cam Durham QMUL IC-HEP RAL-PPD RAL Tier-1 ~800Mb/s 350Mb/s 156Mb/s 166 Mb/s 289 Mb/s 252 Mb/s 118 Mb/s 84Mb/s 397 Mb/s Lancaster Manchester 150 Mb/s Edinburgh 440Mb/s Glasgow 331Mb/s Birmingham 461 Mb/s IC-HEP Oxford 456 Mb/s Cambridge 74 Mb/s Durham 193 Mb/s QMUL 172 Mb/s IC-HEP RAL-PPD 388 Mb/s Initial focus was on getting SRMs understood and deployed….. • Big variation in what sites could achieve • Internal networking configuration issues • Site connectivity (and contention) • SRM setup and level of optimisation • Rates to RAL were generally better than from RAL • Availability and setup of gridFTP servers at Tier-2s • SRM setup and level of optimisation • Scheduling tests was not straightforward • Availability of local site staff • Status of hardware deployment • Availability of Tier-1 • Need to avoid first tests during certain periods (local impacts) Example rates from throughput tests http://wiki.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Tests
Meeting the LCG challengeExample: Tier-1&Tier-2 combined transfer tests • Early attempts revealed unexplained dropouts • Dropouts later traced to firewall • A rate cap at RAL was introduced for later tests • Tests repeated to check RAL capping • Rate was stretched further by using an OPN link to Lancaster http://wiki.gridpp.ac.uk/wiki/SC4_Aggregate_Throughput
Meeting the LCG challengeTier-1&Tier-2 combined transfer tests-rerun http://wiki.gridpp.ac.uk/wiki/SC4_Aggregate_Throughput
GridPP operations: What is next? • SRM deployments now stable and focus has shifted to improving site configurations and optimisations • Sites are now more comfortable with the release/reporting process but concerns remain – gLite 3.0 • We need to continue improving site transfer performance but also extend the tests to include such things as sustained simultaneous reading and writing • Several sites are receiving new equipment – we need to ensure a smooth deployment. 64-bit machines are being deployed in some cases. • GridPP mapped its Tier-2s to experiments for closer working and “proving” of the Tier-2 capabilities. Some progress already but much more needed. • Data is becoming available for understanding performance of sites but integration and automaton is far from ideal. • The installation of network monitoring “boxes” at UK sites • Security – several areas but extending ROC security challenge and implementing an approach for joint logging are in progress. • More interoperation (and joint operations) with NGS
Summary 1 UK e-science has a broad vision with NGS a central part 2 There will be increasing interoperation between UK activities 3 The UK particle physics grid remains one of the largest projects 4 Operational focus will shift to performance measures 5 Progress being made for LHC pilot service but not always smoothly 6 There are clear areas where further work is required