1 / 39

Grid Deployment & Operations:

Grid Deployment & Operations:. EGEE, LCG and GridPP. 20 th September 2005. Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk. Overview. 1 Project Background (to EGEE, LCG and GridPP) . 2 The middleware and its deployment.

liliha
Download Presentation

Grid Deployment & Operations:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Deployment & Operations: EGEE, LCG and GridPP 20th September 2005 Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk

  2. Overview 1 Project Background (to EGEE, LCG and GridPP) 2 The middleware and its deployment 3 Structures developed in response to operating a large grid 4 How the infrastructure is being used 5 Particular problems being faced 6 Summary

  3. A reminder of the Enabling Grids for E-sciencE project 32 Million Euros EU funding over 2 years starting 1st April 2004 • 48 % service activities (Grid Operations, Support and Management, Network Resource Provision) • 24 % middleware re-engineering (Quality Assurance, Security, Network Services Development) • 28 % networking (Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation) Emphasis in EGEE is on operating a production grid and supporting the end-users From Bob Jones’s talk AHM 2004!

  4. The UK & Ireland contribution to SA1 – deployment & operations • Consists of 3 partners: • Grid Ireland

  5. The UK & Ireland contribution to SA1 – deployment & operations • Consists of 3 partners: • Grid Ireland • The National Grid Service (NGS) • - Leeds/Manchester/Oxford/RAL

  6. The UK & Ireland contribution to SA1 – deployment & operations • Consists of 3 partners: • Grid Ireland • The National Grid Service (NGS) • GridPP • Currently the lead partner • Based on a Tier-2 structure

  7. The UK & Ireland contribution to SA1 – deployment & operations • Consists of 3 partners: • Grid Ireland • The National Grid Service (NGS) • GridPP • Currently the lead partner • Based on a Tier-2 structure within the Large Hadron Collider Grid Project (LCG) [See T Doyle’s talk tomorrow 11am CR2] • The UKI structure: • Regional Operations Centre (ROC) • Helpdesk • Communications • Liaison with ROCs and CICs • Monitoring of resources • Core Infrastructure Centre (CIC) • Team take “shifts” to … • Monitor core services and • Follow up on site problems

  8. GridPP is a major contributor to the growth of EGEE resources

  9. When sites join EGEE the ROC … • Records site details in a central Grid Operations Centre DataBase (GOCDB) with access certificate controlled • Ensures that the site has agreed to and signed the Acceptable Use and Incident Response procedures • Runs tests against the site to ensure that the setup is correctly configured NB. Page access requires appropriate grid certificate

  10. Experience has revealed growing requirements for the GOCDB • ROC manager control - To be able to update site information and change the monitoring status for or remove sites • A structure that allows easy population of structured views (such as accounting according to regional structures) • To be able to differentiate pure production sites from test resources (e.g. preproduction services)

  11. EGEE middleware is still evolving based on operational needs Application level services User interfaces Applications EU DataGrid “Collective” services App monitoring system Resource Broker Data management VDT (Condor, Globus, GLUE) “Basic” services User access Information system Information schema Data transfer Security PBS, Condor, LSF,… System software NFS, … Scientific Linux, RHEL… Operating system File system Local scheduler dCache-SRM, DPM… Hardware Computing cluster Network resources Data storage

  12. An overview of the (changing) middleware release process Release(s) Update Release Notes Update User Guides Re-Certify EIS GIS YAIM CIC User Guides Release Notes Installation Guides Every Month Every 3 months on fixed dates ! Certification is run daily Release Release Client Release Deploy Client Releases (User Space) Every Month 11 GIS Deploy Service Releases (Optional) Deploy Major Releases (Mandatory) at own pace CICs RCs ROCs RCs Site deployment of middleware YAIM – bash script. Simple and transparent. Much preferred by administrators. QUATTOR – Steep learning curve but allows tighter control over installation. Patches & functionality Vs stability! Porting to “non-standard” LCG operating systems

  13. A mixed infrastructure is inevitable and local variations must be manageable • Releases take time to be adopted – how will more frequent updates be tagged and handled!? • Grid Ireland has a completely different deployment model to GridPP (central vs site based)

  14. Additional components are added such as for managed storage The Grid Storage Element interfaces “Handlers” TAPE storage (or disk) File Metadata Access Control • Storage Resource Management interface • Provides a protocol for large scale storage systems on the grid • Clients can retrieve and store files, control file lifetimes and filespace • Sites will need to offer an SRM compliant storage element to VOs • These SEs are basically filesystem mount points • on specific servers • There are few solutions available and • deployment at test sites has proved time • consuming (integration at sites, understanding • hardware setup (documentation improving))

  15. Once sites are part of the grid they are actively monitored • The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc. • These have recently been updated as certain “critical tests” gave a misleading impression of a site

  16. Once sites are part of the grid they are actively monitored • The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc. • These have recently been updated as certain “critical tests” gave a misleading impression of a site • The tests are being used (and expanded) by Virtual Organisations (VOs) to select stable sites (to improve efficiency)

  17. Once sites are part of the grid they are actively monitored • The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc. • These have recently been updated as certain “critical tests” gave a misleading impression of a site • The tests are being used (and expanded) by Virtual Organisations (VOs) to select stable sites (to improve efficiency) • They have proved very useful to sites and can now be run by them on demand

  18. The tests form part of a suite of information used by the Core Infrastructure Centres (CICs) • There are currently 5 CICs in EGEE • Introduction of a CIC on Duty rota (whereby each CIC oversees EGEE operations for 1 week at a time) saw a great improvement in grid stability • Available information is captured in a Trouble Ticket and sent to problem sites (and their ROC) informing them that there is a problem • Tickets are automatically escalated if not resolved • Core services are monitored in addition to sites

  19. Good, reliable and easy to access information has been extremely useful to sites and ROC staff • At a glance we can see for each site: • whether it passes or fails the functional tests • if there are configuration errors (via “sanity checks”) • what middleware version is deployed • the total job slots available and used as published by the site • basic storage information • average and maximum published jobs slots showing deviations

  20. With a rapidly growing number of sites and geographic coverage many tools have had to evolve

  21. And new ones developed. EGEE and LCG metrics are an increasing area of focus – how else are we to manage!

  22. We need to develop a better understanding of grid dynamics Is this the result of a loss of the Tier-1 scheduler? Or just a problem with the tests! Is this several sites with large farms upgrading?

  23. The good news is that UKI is currently the largest contributor to EGEE resources

  24. … and resource usage is growing (at 55% for August and 26% for period from June 04 • Utilisation may worry some people but note that the majority of resources are being deployed for High Energy Physics experiments which will ramp up usage quickly in 2007 • Recent activity is due partly due to a Biomedical data challenge in August

  25. Several sites have been running full for July/August. The plot below is for the Tier-1 in August

  26. However full does not always mean well used! • The plot shows weighted job efficiencies for the ATLAS VO in July 2005 • Straight line structures show jobs which ran for a period of time before blocking on an external resource and eventually being killed by an elapsed time limit • Clusters at low efficiency probably show performance problems on external storage elements • Many problems seen here are NOW FIXED

  27. … and some sites have specific scheduling requirements Batch server and cluster Configuration, Job queue, State table Scheduler and additional cluster Configuration qsub,qdel,qstat Batch Server (pbsserver) Job, start, stop, status Node,job,start,stop Scheduler plug-in Job, start, stop, status status Execution host (pbsmom Execution host (pbsmom Execution host (pbsmom) Execution host (pbsmom) Grid scheduling (using user specified requirements to select resources) Vs Local policies (the site “prefers” certain VOs)

  28. The user community is expanding creating new problems Pilot New • Over 900 users in some 60+ VOs • UK sites support about 10 VOs • Opening up resources for non-traditional site VOs/users requires effort • Negotiation between VOs and the regional sites has required the creation of an “Operational Advisory Group” • New Acceptable Use policies which apply across countries and agreeable (and actually readable) are taking time to develop.

  29. Aggregation of job accounting is recording VO usage Web summary view of data GOC SITE

  30. Aggregation of job accounting is recording VO usage, but … Web summary view of data GOC SITE • Not all batch systems are covered • Not all sites are publishing data • Farm normalisation factors are not consistent • Publishing across grids yet to be tackled (but the solution in EGEE does use a GGF schema)

  31. GridPP data is reasonably complete for recent months Note the usage by non particle physics organisations. This is what the EGEE grid is all about.

  32. Support is proving difficult because the project is so large and diverse Site administrators Users Experiments/VOs GOSC (Footprints) LCG-ROLLOUT TB-SUPPORT Grid-Ireland helpdesk (Request Tracker) GGUS (Remedy) UKI ROC ticket tracking system (Footprints) Regional service 1 Regional service 1 Regional service 1 CIC-on-duty Site A Savannah – bug tracking Tier-1 helpdesk (Request tracker) Site A Site A Site A • This is ONLY the view for the UKI operations centre. There are 9 ROCs

  33. The EGEE model uses a central helpdesk facility and Ticket Process Managers TPM VO Support TPM I need help! I send e-mail to vo-user-support@ggus.org E-mail automatically converted in GGUS ticket. Can be addressed to TPM VO only, or TPM only, or to both Ticket Process Manager: Monitor ticket assignments. Direct to correct support unit. Notify users of specific actions and ticket status TPM VO Support: People from VOs. Receive tickets VO related and follow them. Solve/forward VO specific problems. Recognize Grid related problems and assign them to specific support units or back to TPM VO Support Units ROC Support Units Middleware Support Units Other Grids Support Units CIC Support Unit Mailing lists

  34. The EGEE model uses a central helpdesk facility and ticket process managers, but … TPM VO Support TPM Some users are confused - mixed messages The central GGUS facility is taking time to become stable Ticket Process Managers are difficult to provide – EGEE funding did not account for them VOs still have independent support lists and routes – especially the larger VOs VO Support Units Linking up ROC helpdesks is taking time. Getting VOs to populate their follow up lists is not happening quickly Mailing lists are very active on their own!

  35. Interoperability is another area to be developed • In terms of: • Operations • Support • Job submission • Job monitoring • … • Currently the VOs/experiments develop their own solutions to this problem.

  36. Some other areas which are talks in themselves! • Security • Getting all sites to adopt best practices • check patches • check port changes • reviewing log files • Scanning for grid wide intrusion • Network monitoring • Aggregation of data from site “network boxes” • Mediator for integrated network checks

  37. Going forward, one of the main drivers pushing the service is a series of service challenges in LCG • Main UK site connected to CERN via UKLIGHT • Up to 650 Mb/s sustained transfers • 3 “Tier-2” centres deployed an SRM and managed sustained data transfer rates up to 550 Mb/s over SJ4. One connected via UKLIGHT {

  38. Summary 1 UK&I has a strong presence in EGEE and LCG 2 Our grid management tools are now evolving rapidly 3 Grid utilisation is improving – we start to look at the dynamics 4 Growing focus areas include support and interoperation (and gLite!) 5 There is a lot of work not covered here! Fabric:Security:Networking… 6 Come and visit the GridPP (PPARC) and CCLRC stands!

  39. VOMS gLite vs LCG-2 Components Catalogue and access control LFC RB gLite WLM FIREMAN myProxy BD-II BD-II APEL dgas Independent IS R-GMA R-GMA R-GMAs can be merged (security ON) UIs gLite-IO LCG gLite LCG CE SITE CEs use same batch system WNs gLite-CE LCG FTS for LCG uses user proxy, gLite uses service cert FTS FTS shared SRM-SE Data from LCG is owned by VO and role, gLite-IO service owns gLite data gLite

More Related