450 likes | 582 Views
The EGEE infrastructure. Dr. Ian Bird CERN SA1 Activity Manager EGEE’07 Conference, Budapest 2 nd October 2007. Outline. Overall status and usage Progress in the past year Growth in resources and use Security activities Operations Pre-production service
E N D
The EGEE infrastructure Dr. Ian Bird CERN SA1 Activity Manager EGEE’07 Conference, Budapest 2nd October 2007
Outline Overall status and usage Progress in the past year • Growth in resources and use • Security activities • Operations • Pre-production service • Certification and testing • Network support • SLAs Monitoring advances Expectations for the next year • New services • EGEE-III Summary EGEE'07; 2nd October 2007
Test-beds & Services Operations Coordination Centre Production Service Pre-production service Regional Operations Centres Certification test-beds (SA3) Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Security & Policy Groups Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Operations Advisory Group (+NA4) The EGEE Infrastructure Support Structures & Processes Training activities (NA3) Training infrastructure (NA4) EGEE'07; 2nd October 2007
Resources EGEE'07; 2nd October 2007
Increasing workloads Still expect factor 5 increase for LHC experiments over next year 32% EGEE'07; 2nd October 2007
Use of the infrastructure EGEE: ~250 sites, >45000 CPU 24% of the resources are contributed by groups external to the project ~>20k simultaneous jobs EGEE'07; 2nd October 2007
Operations progress Progress/success: Production service, Oct ’06 to Sep ’07: • Number of sites: ~190 => ~240 (x1.25 increase) • average number of jobs/month for preceding 12 months: 0.97 million => 2.46 million x2.5 increase) • peak number of jobs in preceding 12 months: 1.45 million (June 06) => 3.11 million (May 07) (x2.14 increase) • number of CPUs: ~32,000 => ~46,000 (x1.44 increase) • Increase in number of teams involved in grid operations (CODs): • The work is now shared by the two teams who are on duty (it used to be primary/backup set-up where the backup only came on-line as needed). This is actually a better way to work as it means the teams do not have such long breaks between shifts (used to be ~10 weeks) EGEE'07; 2nd October 2007
SFT SAM Migrated from SFT to SAM • Massive improvements in standardizing the framework. • anyone can now easily contribute tests • now easier for people to run their own instance of the service • SAM now used in one way or another by all the LHC experiments • Started generating site availability reports EGEE'07; 2nd October 2007
Operations progress Successful releases of major updates to many central operations services (GOCDB, CIC Portal, GGUS) • CIC Portal new features include raising of alarms and masking of unnecessary alarms (leading to less time wasted by CODs) • RSS feed for CIC Portal alarms so that site administrators can monitor their own sites • Major update to GOCDB which included many new, useful features • Still a few bugs to fix Implementation of failover for most central operations services • Still needed for GOC database • improvements still needed for other operations services (for example CIC Portal) EGEE'07; 2nd October 2007
Operations progress Implementation of a formalized grid middleware release processes • Moved from “big bang” releases to incremental updates • Formal, documented process now in place which is handled by teams rather than single-point-of-failure individuals • For details: http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/index.php?dir=./release/ Release of WMS – better performance and reliability cf RB. Full deployment of FTS service Process implemented to track most urgent/important grid issues by the ROCs. • These are passed to the TCG where appropriate and have resulted in significant improvements, for example standardization and improvement of middleware logging. Interoperability with OSG in production • CMS now submit jobs to both grids (EGEE and OSG) through a single WMS Moved to SL4 version of WN. Other services coming soon. EGEE'07; 2nd October 2007
Issues for Operations Need to improve reliability of ‘user’ services (users care about successful jobs and this involves many grid middleware services). • Resilience to glitches • Identification and treatment of SPoFs Not clear how far the current COD structure can scale Central operations services (Gstat, GOCDB, CIC portal, etc.) are all now interdependent and heavily used for day-to-day operations. • The failover mechanisms and upgrades mechanisms need to be improved to keep down-time to a minimum. Still need to keep improving the release notes. Still a major cause of deployment issues. Need dedicated interoperability testing VOViews and Job Priorities is confusing for many sites EGEE'07; 2nd October 2007
Operational Security Operational Security Coordination Team (OSCT) Successes: • the OSCT will provide its first security training event during EGEE07. All service managers and site administrators are welcome Issues: • the OSCT is looking for additional experts to contribute to its activities, people with security interest should contact the team Progress: • the OSCT is gradually introducing SAM Security tests to check for known security issues at the sites • note: it uses special tests in SAM, securely transported and visible only to the OSCT EGEE'07; 2nd October 2007
JSPG The successes during the last year have been: Updating the top-level Security Policy to make simpler and more general. • generalisation and simplification of the policies has been needed to achieve interoperable (identical) policies between EGEE, OSG, NDGF and others. New policies: Site Operations, VO Operations, Pilot Jobs Issues: Need to review and update several older policy documents (in EGEE-II) to remove duplications and ambiguities. Work on next revision of full policy set to make even more general and applicable to more Grids in world of EGI and NGI's (in EGEE-III). EGEE'07; 2nd October 2007
GSVG The EGEE-II Grid Security Vulnerability issue handling is now approved and in use Deliverable DSA1.3, which includes a summary of the GSVG strategy has been approved by the PEB and accepted by the EU. • This allows the disclosure of issues concerning EGEE middleware when they reach the Target Date for resolution The Risk Assessment team is handling Security Vulnerability issues and carrying out Risk Assessments: Since GSVG started (end 2005): • 122 issues analysed (1 – 2 per week) • 62 open (42 are sw bugs); 60 closed (25 bug fixes, 7 operational) • 1 extremely critical, 9 high risk (2 open) EGEE'07; 2nd October 2007
User Support – 1 Technical and procedural efforts: Lots of technical Improvements: • new search engine • ticket linking • subscription to tickets • local helpdesks • Reporting tools Bidirectional interface with OSG user support TPM first line support works smoothly now Clear distinction between Services and Software Support Units Still responsiveness issues when problems leave the influence sphere of SA1 EGEE'07; 2nd October 2007
User Support – 2 Documentation: Transparent development through ESC shopping listhttps://savannah.cern.ch/projects/esc/ A prototype for better quality ticket submission is available on https://iwrgustrain.fzk.de/pages/ticket1.phpPut comments in shopping list ticket #102127 or send them to ggus-info@cern.ch Rigorous ticket progress reporting and monitoring is now possible in: https://gus.fzk.de/pages/download_escalation_reports.php and http://goc.grid.sinica.edu.tw/gocwiki/TPM_monitoring_reports A collection of information sources assembled in https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport EGEE'07; 2nd October 2007
User Support – 3 Communication: With all Grid Sites, including OSG, weekly at the Operations meeting. With ROCs, VOs and GGUS developers monthly at the ESC With GGUS developers fortnightly in the Shopping List review that defines the content of the (monthly) GGUS Releases. VOs begin to realise the importance of a strong user support (CHEP'07) Workshop to establish and improve connection between grid and VO user support see SA1 session on Thursday afternoon in "VO managers and ROC managers issues" EGEE'07; 2nd October 2007
Pre-production service • Pre-production service is now ~ 27 sites in 16 countries • Provides access to some 3000 CPU • Some sites allow access to their full production batch systems for scale tests • Sites install and test different configurations and sets of services • Weekly update cycle • Try to get good feeling for the quality of the release or updates before general release to production • Larger sites gain experience on PPS before going to production. • Services may be initially demonstrated in this environment • Before further development • New VO-s: adapt their applications & gain experience • (e.g. DILIGENT) EGEE'07; 2nd October 2007
Pre-production service Issue: The service is not used at the level that it was intended Many issues for LHC experiments • Lack of effort • Difficulty to test complex software stacks not in production environment Slows down deployment process – but good for sites to get pre-deployment look at changes, new services, etc. Cannot justify the cost? Discussion on future of PPS during this conference EGEE'07; 2nd October 2007
Progress in Certification Handling change with the established process • 1 release per week • Process evolving, based on experience • 289 Patches in 2007 • Corresponding to 820 closed bugs • ~10 Patches are in work in parallel • Limited by resources • Patch certification sees more partner participation • 18 Patches certified by external partners • We have to increase this Extensive use of the “Experimental Services” process • Only way to address scalability and stability of core services • For the WMS the service moved outside CERN • Service run by INFN • Verification of checkpoint releases by Imperial College EGEE'07; 2nd October 2007
Certification & Testing Improved Test Coverage • especially Data Management • Pre-certification release to interested user communities • Very early feedback to developers Extensive use of virtual test-beds Changes: YAIM-4 Configuration support tool • Independent releasable modules per component • Opened YAIM for developers and site admins • Major refactoring of the tool • Removed almost all legacy Python configuration Move to ETICS • Difficult transition • Sometimes 3 build systems involved in one release EGEE'07; 2nd October 2007
Porting Move to SL4 and VDT-1.6 (32 and 64 bit) • Much delayed • Revised plan and plan for restructuring gLite • Still in progress (but getting close) • WN and UI (32-bit) are in production • LCG-CE has been ported to SL4 + VDT-1.6 • Will reach PPS in 2 weeks (including DGAS support) • WMS/LB gLite 3.1 / SL4 version • certification in about 2 months • BDII released to PPS • DPM and LFC have been tested internally on SL4 (32- and 64-bit) • Just waiting for the yaim component to complete certification • FTS-2 SL4 pilot service is planned for October • Release and deployment at T1s in January • VOBOX prototype has been setup during summer • 1-2 months • Glite-PX • Finalising configuration ( 1 month ) EGEE'07; 2nd October 2007
Porting – cont. • Glite-MON • Need config for tomcat 5.5 • glite-SE_classic • Just started working, but simple • Glite-VOMS • Being processed as patch #1322 • ~2 months Strategy for 64-bit is prioritised; • WN + Torque client • DPM-disk • UI • Other services depending on 64bit advantage Currently the 64-bit WN + torque is undergoing runtime testing • management scripts need to be updated to accommodate packages which must be installed 32/64 EGEE'07; 2nd October 2007
Status Plan (May) • As shown at the EGEE review • Problems to move to gLite-3.1 (including ETICS) • Addressed by the PMB endorsed “gLite restructuring plan” “gLite restructuring plan” WMS gLite-CE SL4&VDT1.6 Revised Plan UI/WN SL4&VDT1.6 Nov Dec Jan Feb Mar Apr May Jun Jul Aug WN Fall Back SL3 code on VDT1.2 on SL4 On PPS WN-3.1 SL4 released UI very close 90+% of all components build FTS, DPM, LFC,…… Move to SL4&VDT1.6 Independently when they are ready UI Fall Back SL3 code on VDT1.2 on SL4 Fall Back Solutions delivered on time, minimal impact on sites Build with 3.1 Build System EGEE'07; 2nd October 2007
Progress around networking (SA2) The EGEE Network Operations Centre (ENOC): • 65% of EGEE certified sites covered (72% in Europe) • Receiving incidents & maintenance notices from NRENs • Linking them with detected troubles on the EGEE infrastructure • Monitoring of the sites’ network availability • Web interface presenting the results • Data available via HTTP/XML for other usages (Nagios, COD) • https://ccenoc.in2p3.fr/DownCollector/ • Improve the network monitoring within EGEE but performance data still missing! • LHC Optical Private Network operational model • Critical for the reliability of LCG • Ongoing formalization of the roles, functions & processes • In collaboration with LCG and NRENs More info and details: • Dedicated network session on Wed. morning (11:00-12:30) • https://ccenoc.in2p3.fr/ EGEE'07; 2nd October 2007
Progress in SLAs SLA working group put in place A draft SLA document has been produced for discussion • Based on experience in other projects and ROCs • Some outstanding issues still to be addressed – see discussion at this conference • Metrics have not yet been agreed So far mainly addresses agreements between sites and ROCs Covers: • Responsibilities (ROCs and Sites) • Hardware and connectivity • Services to be provided • Service hours • Availability • Support – general and for Vos • Service continuity and security • Service reporting & reviewing EGEE'07; 2nd October 2007
Monitoring landscape Domain Monitoring Tools in use Grid Applications Application monitoring Experiment Dashboards ... Gstat SAM/GridView GridICE GridPP Real Time Monitor ... Grid Middleware centralservices Grid Services monitoring site services Localresources Lemon/SLS Nagios Ganglia... Local monitoring site 3 Monitoring Working Groups EGEE'07; 2nd October 2007
Monitoring working groups System Management Fabric management Best Practices Security ……. • Grid Services • Grid sensors • Transport • Repositories • Views • ……. • System Analysis • Application monitoring • …… Goal: Improve overall reliability of sites and services EGEE'07; 2nd October 2007
High Level Model EGEE'07; 2nd October 2007
Prototype site implementation EGEE'07; 2nd October 2007
Nagios display EGEE'07; 2nd October 2007
Treemap visualization EGEE'07; 2nd October 2007
Changes to services SL4 • SL4 deployment in progress • Will have 64-bit versions of many components • Need to ensure next ports are not showstoppers (SL5, …) WMS • gLite WMS is replacing the old RB’s; RB’s will not be supported CE • Strategy for the CE has been agreed: • LCG-CE has been ported to SL4 • CREAM and gLite-CE were both shown to provide basic performance levels • Cannot afford to bring both to production – focus on CREAM – expect a deployable version in early 2008 EGEE'07; 2nd October 2007
Pilot jobs & glexec Pilot jobs are a reality (and have been for some time); need to ensure correct audit and/or identity management gLexec can be used to authorize users via LCAS/LCMAPS Job priorities Has been an ongoing issue – desire to base priorities on VOMS roles/groups Short term “simple” solution caused many problems – and has now been fixed; this seems to be sufficient for ~ next year Longer term: re-look at end-end authn/authz with real use cases Service evolution EGEE'07; 2nd October 2007
Grid Operations in EGEE-3 No major changes – consolidation of existing activities • Overall level of effort is ~25% reduced from EGEE-II Will continue with the 5 major tasks that we currently have: • Grid management • Grid operations & support • User support • Operational security • General and admin tasks Emphasis on improving reliability, robustness, usability, support All ROCs will do all key operational tasks • Operator on Duty • TPMs and GGUS support effort • Security coordination – OSCT Suppression of some sub-tasks • Have seen no justified case for regional certification • Porting tasks are in SA3 EGEE'07; 2nd October 2007
Specific areas to address Monitoring and oversight should evolve towards automation • Between EGEE-3 and EGI must reduce operations effort (by factor 2?) • Need to have a plan for automation, alarms, etc Service Level Agreements • Part of the overall effort of QA • Categorization of sites; different deployment scenarios Integration of operations with existing and embryonic National Grid Infrastructures • Transition plan to EGI/NGI; need to understand what NGIs will do Integrating new VOs into the infrastructure EGEE'07; 2nd October 2007
New VO support Based on discussion in Stockholm workshop: Catch-all/regional VOs Regions agree to support any VO with users in the region All JRUs/NGIs commit a certain fraction of their resources Pool of additional “seed resources” • 75 k € requested for CPU and disk • To be installed at max of 3 sites who guarantee access and high level of service to new VOs VO managers group will identify new VOs eligible for project support Core services for new VOs assigned to set of sites that have agreed to provide this – round-robin if no existing relationship VOs must provide “ID card” – full set of information needed by sites EGEE'07; 2nd October 2007
Challenges • Organizational • Many grids (campus, national, …) • NGI/EGI • Devolution: central model fully distributed How will a VO get dependability of services in this scenario? Usage - related Scale Reliability Usability EGEE'07; 2nd October 2007
Challenges Middleware • Complexity of the full distribution • Time for porting, etc • Time to go to production – unrealistic expectations • E.g. gLite WMS • The biggest technical challenge for EGEE ?? EGEE'07; 2nd October 2007
Summary EGEE infrastructure has continued to grow – • sites, resources, usage Still need to scale workloads by at least x5 in the next year for LHC Major challenges: reliability, usability, manageability have improved – but not enough Significant efforts in monitoring to try and help But must focus on stabilizing what we have and not trying to add too much Significant progress in the past year – The start up of LHC will be a major test of the infrastructure EGEE'07; 2nd October 2007