400 likes | 410 Views
gLite Middleware Status. Frédéric Hemmer, CERN LCG SC2 October 14, 2005. Outline. gLite processes Middleware Work plan and Milestones Baseline services New versions gLite usage Manpower situation Open questions Achievements Concerns & Risks. gLite Processes. Testing Team
E N D
gLite Middleware Status Frédéric Hemmer, CERN LCG SC2 October 14, 2005
Outline • gLite processes • Middleware Work plan and Milestones • Baseline services • New versions • gLite usage • Manpower situation • Open questions • Achievements • Concerns & Risks LCG SC2 - October 14, 2005
gLite Processes • Testing Team • Test Release candidates on a distributed testbed (CERN, Hannover, Imperial College) • Raise Critical bugs as needed • Iterate with Integrators & Developers • Once Release Candidate passed functional tests • Integration Team produces documentation, release notes and final packaging • Announce the release on the glite Web site and the glite-discuss mailing list. • Deployment on Pre-production Service and/or Service Challenges • Feedback from larger number of sites and different level of competence • Raise Critical bugs as needed • Critical bugs fixed with Quick Fixes when possible • Deployment on Production of selected set of Services • Based on the needs (deployment, applications) • Today FTS clients, R-GMA, VOMS • Architecture Definition • Based on Design Team work • Associated implementation work plan • Design description of Service defined in the Architecture document • Really is a definition of interfaces • Yearly cycle • Implementation Work plan • Prototype testbed deployment for early feedback • Progress tracked monthly at the EMT • EMT defines release contents • Based on work plan progress • Based on essential items needed • So far mainly for HEP experiments and BioMed • Decide on target dates for TAGs • Taking into account enough time for integration & testing • Integration Team produces Release Candidates based on received TAGs • Build, Smoke Test, Deployment Modules, configuration • Iterate with developers As of beginning of 2005, the focus has been on essential (simple) services and bug fixing – e.g. FTS, R-GMA LCG SC2 - October 14, 2005
Functional Tests gLite Release Process Development Integration TestingCoordinated with Applications and Operations Deployment Packages Software Code Fail Pass Testbed Deployment Integration Tests Fix Fail Pass Installation Guide, Release Notes, etc LCG SC2 - October 14, 2005
Work Plan in practice • Monthly meeting • Design team – real International collaboration • Focus on mid term Middleware evolution • Gather together Middleware architects, security and operations (recently) • From EU and US (Condor, Globus, …) • Drives the work of Middleware re-engineering in the medium term • Weekly meetings • R-GMA • Between LCG Deployment and EGEE UK team • Focus on fixing major R-GMA deficiencies preventing it’s use for accounting • gLite Monitoring • NA1, NA4, SA1, JRA1 • Review the progress of the Preproduction service • Review the list of critical bugs • EMT • EGEE JRA1 internal meeting • Cluster heads + security • Follow-up critical bugs directly from developers reps. • Establish the planning for new releases • Oversees deliverables • Attempts to coordinate activities • Daily Meetings • FTS • Between LCG SC3 team & EGEE Data Management • Focus on short term FTS issues for SC3 LCG SC2 - October 14, 2005
CE Revisited by the Design Team Miron Livny, February 2005: gLite is not “just” a software stack, it is a “new” framework for international collaborative middleware development. Much has been accomplished in the first year. However, this is “just” the first step. LCG SC2 - October 14, 2005
Work Plan & Milestones LCG SC2 - October 14, 2005
gLite Releases and Planning gLite 1.1.2 Special Release for SC File Transfer Service gLite 1.4.1 Service Release gLite 1.1.1 Special Release for SC File Transfer Service gLite 1.3 File Placement Service FTS multi-VO Refactored RGMA & CE gLite 1.4 VOMS for Oracle SRMcp for FTS WMproxy LBproxy DGAS gLite 1.0 Condor-C CE gLite I/O R-GMA WMS L&B VOMS Single Catalog gLite 1.2 File Transfer Agents Secure Condor-C gLite 1.1 File Transfer Service Metadata catalog Functionality gLite 1.5 Release Date QF1.3.0_22_2005 QF1.3.0_20_2005 QF1.3.0_21_2005 QF1.3.0_19_2005 QF1.1.2_11_2005 QF1.3.0_18_2005 QF1.1.0_09_2005 gLite 1.5 Functionality Freeze QF1.1.0_10_2005 QF1.0.12_04_2005 QF1.3.0_17_2005 QF1.1.0_07_2005 QF1.1.0_08_2005 QF1.0.12_02_2005 QF1.1.2_16_2005 QF1.0.12_03_2005 QF1.1.2_13_2005 QF1.1.0_05_2005 QF1.3.0_23_2005 QF1.1.0_06_2005 QF1.2.0_14_2005 QF1.2.0_15_2005 QF1.0.12_01_2005 QF1.1.2_12_2005 April 2005 May 2005 June 2005 July 2005 Aug 2005 Sep 2005 Oct 2005 Nov 2005 Dec 2005 Jan 2006 Feb 2006 Today LCG SC2 - October 14, 2005
6.1 Security Detailed Milestones(From Work Plan) LCG SC2 - October 14, 2005
6.2 Security Detailed Milestones7.1 Common Infrastructure Detailed Milestones LCG SC2 - October 14, 2005
7.2 Specific Services Detailed Milestones LCG SC2 - October 14, 2005
7.2 Specific Services Detailed Milestones (II) LCG SC2 - October 14, 2005
7.2 Specific Services Detailed Milestones (III)8 Additional Services LCG SC2 - October 14, 2005
Baseline Services and gLite Mapping LCG SC2 - October 14, 2005
Other Services • DGAS • Accounting System • AMGA • Generic Metadata catalog (ARDA, PTF, gLite, others) • Will be in the next release of gLite and follow its processes • File Placement Service • Catalog Interactions • No global scheduling/routing (Data scheduler) • G-Pbox • Policy engine • Cream • Web Services based CE LCG SC2 - October 14, 2005
Development and Release Plan • Development are essentially driven by the Work Plan • But priority is always given to bug fixes • A “Most Critical Bugs” lists is reviewed weekly • Involving SA1, Pre-production responsible, ARDA responsible and EGEE Management • There are ~ 50 Critical bugs • The defect tracking system unfortunately mixes bugs on already released and development software • With gLite 1.4, 50% of the critical bugs are opened by the developers and/or integrator/testers • In practice subsystems evolve differently: • FTS/FPS improvements are jointly decided with the Service Challenges Team • Very detailed work plan available on the LCG SC Wiki • R-GMA improvements are partly driven by experience running on the Production Service and ARDA feedback are discussed on a weekly basis • WMS improvements are driven by various feedback, from production usage, Task Forces, Design Team meetings, etc… • Catalog improvements are essentially driven by usage on prototype, on the preproduction service, and close contacts with Bio-Medical and Diligent communities LCG SC2 - October 14, 2005
Development and release plan for future versions: FTS • Developers are part of the Service • Detailed work plan agreed and reviewed regularly between gLite, SC3 and (some) Experiments (TF) • https://uimon.cern.ch/twiki/bin/view/EGEE/DMFtsWorkPlan New Features Service General 07 Oct 2005 - 12:55 - GavinMcCance LCG SC2 - October 14, 2005
ALICE WMS, LB, CE FTS ATLAS WMS (WMProxy), LB, CE G-PBOX (not released) FTS CMS WMS (WMProxy), LB, CE FTS (Phedex) R-GMA (CMS Dashboard) LHCb WMS, LB, CE FTS AMGA All VOMS gLite Services used or intended to be used by Experiments LCG SC2 - October 14, 2005
Plans for Testing and certification of new versions • Deliver update services • According to work plan • Following usual processes • FTS: First testing on SC3 • All services through the usual processes • JRA1 Integration & Testing • Certification by SA1 (SFT, Certification Testbeds) • Deployment on Pre-production Service • Eventually deployment on Production Service • Additionally other communities contribute to the testing • ARDA • BioMedical Community • DILIGENT • Some services will just be prototyped • CE with WSS • CE with GT4-GRAM – Condor-C • CREAM, G-PBOX LCG SC2 - October 14, 2005
A Note on Integration and Testing • This is a concern area • Integration and Testing are under tremendous pressure • Software is advertised by developers as being fully working • Pressure from applications to have it deployed • Target dates for software tags are not respected • Components usually do not build/execute properly • Documentation (deployment) is usually weak • Takes much longer than originally planned to close a release • In the mean time critical bugs are opened on earlier versions • Requiring Integration and Testing of Quick fixes • Integration & Testing diversity matrix is increasing • Services using MySQL 4.x, Oracle 9 &10, Tomcat, gSoap 2.6, 2.7, Axis 1.1 & 1.2 • Castor, Castor2, dCache, DPM SRM’s (all complex) • Upgrade testing from previous gLite versions • External dependencies increasing • Currently 97 external dependencies • This leaves very little time for adding new tests LCG SC2 - October 14, 2005
Status of the gLite MiddlewareCertification • gLite Certification Testbed • Goal is to certify a pure gLite environment • Prior to deployment in the preproduction service • Reinstalled from scratch with every new release • Installed exclusively according to gLite documentation • Most critical bugs are raised from experience on this testbed • Composed of ~30 nodes providing the following functionality from gLite 1.2 &1.3 • VOMS • R-GMA • Fireman (Oracle and MySQL) • LCG BDII • A few RB’s • 3 CE’s • 10 Worker Nodes • FTS and FTA • Mixed LCG/gLite Testbed • Goal is to verify gLite-LCG interoperability (gLite 1.3 – LCG 2.6.0) • Hybrid system which probably will reflect the reality • gLite WMS to LCG CE job submission • gLite WMS interaction with LFC • LCG RB to gLite CE job submission • gLite and LCG CE share of the same Worker Nodes • Composed of 8 nodes LCG SC2 - October 14, 2005
#CPUs #Job Submit IN2PN3,Lyon na CERN PPS has got access to the Production Cluster via an LSF queue Currently self-limited to 50 running jobs Extensible to 1400 CPU FZK, Karlsruhe na na UOA, Athens UOM, Thessaloniki 2 UPATRAS, Patras 3 50194 CNAF Access to Production Cluster in preparation planned for the beginning of October Then ~150 slots will be available CNAF, Bologna 4 423289 NIKHEF, Amsterdam na CYFRONET, Krakow na LIP, Lisbon 2 CESGA, S. de Compostela Among the major computing centres, CERN and CNAF are the only ones currently granting access to production facilities 2 IFIC, València na PIC, Barcelona 3 30863 CERN, Geneva 50 546853 ASGC, Taipei na Status of the gLite MiddlewarePre-Production Service • Resources Resources ~200 CPU (October) Jobs ~1M (submitted to WMS) A. Retico, N Thackray Joint Ops Workshop Sep. 27, 2005 LCG SC2 - October 14, 2005
IN2PN3,Lyon FZK, Karlsruhe NIKHEF Currently phasing out from the PPS activity Only available VOMS service for a long time They introduced the “Star Trek VO” concept So … Thanks UOA, Athens UOM, Thessaloniki UPATRAS, Patras CNAF, Bologna NIKHEF, Amsterdam CYFRONET, Krakow LIP, Lisbon ASGC Joining the PPS They start running a WSM +LB service Welcome ! CESGA, S. de Compostela IFIC, València PIC, Barcelona CERN, Geneva Status of the gLite MiddlewarePre-Production Service • Core Services FTS Work Flow Management VO Management Information System WMS + LB VOMS WMS + LB BDII Catalogues VOMS Authentication Data Management MyProxy IO(DPM) R-GMA Fireman(My) IO(castor) WMS + LB A. Retico, N Thackray Joint Ops Workshop Sep. 27, 2005 WMS + LB IO(DPM) ASGC, Taipei WMS + LB LCG SC2 - October 14, 2005
Status of the gLite MiddlewarePre-Production Service • Monitoring information from various tools is collected in R-GMA archiver • Summary generator calculates overall status of each monitored object (site, CE, ...) - update: 1h • Metric generator calculates numerical value for each monitored object + aggregation (CE → site → region → grid) - update: 1 day P. Nyczyk Joint Ops Workshop Sep. 27, 2005 LCG SC2 - October 14, 2005
Status of the gLite MiddlewareProduction and Service ChallengesFTS, VOMS FTS clients have been distributed as of LCG 2.6.0 VOMS has been distributed as of LCG 2.6.0 LCG SC2 - October 14, 2005
Manpower situationPartners LCG SC2 - October 14, 2005
Manpower SituationMay 2005 LCG SC2 - October 14, 2005
EGEE-II • New Activity SA3 merging • SA1 certification • JRA1 Integration and Testing • most of JRA1 Data Management • JRA1 Leader and deputy reassigned to other functions • JRA1 Leadership transferred to INFN • SA1 & JRA1 CERN will merge under Ian Bird’s Leadership • Technical Coordination Group • Includes representation of the Experiments Task Forces LCG SC2 - October 14, 2005
Open questions • Many software improvements foreseen could not be done • Common logging • Major changes in existing components • Third party software would anyway not use this. • Common configuration • Proposal made, reluctance from development • Understaffing of integration team • Diversity of languages • Need one solution per language • End to end coherent data security • Understood, planned but unreleased yet • Difficult to balanced site concerns, user wishes and third party packages • EGEE-II reorganization • Not yet clear how LCG certification and EGEE Testing & Integration merging will happen • Not clear how effective will the new JRA1 Activity be • Not clear how well the new Technical Coordination Group will work • What will happen to the Design Team? LCG SC2 - October 14, 2005
Achievements • gLite releases • Have now regularly taking place, including documentation • Maintenance mechanisms in place (Critical Bugs, QFs) • Including prioritization (driven by the main customers) • A software process is in place • Ironing out software defects before releases • FTS • Very close collaboration between the development and the service challenge teams • FTS deployed and new requirements are handled quickly • R-GMA • Although not yet 100% satisfactory, focused regular meetings have helped improving R-GMA reliability • Design Team • Small international group of competent people understanding each other • Task Queue, Condor-C integration in WMS, Storage Index, Data & Job Management security models, WSS, future VO scheduler, etc… • VDT • VOMS, LB, CEMon are scheduled (using NMI processes) • Collaboration in particular with University of Wisconsin/Madison • Not only Condor, also NMI, relations with OSG, etc.. • Significant (not reported) manpower dedicated to gLite related issues LCG SC2 - October 14, 2005
Concerns and Risks • Technical • VOMS • Is regularly the cause of delays in releasing • VOMS admin does not seem to be staffed from the project • Risk that VOMS might be replaced by something else • RFIO from Castor & DPM • Prevents gLite I/O to talk to both SRM’s • Make it difficult for people to switch between production data and PPS • Lead to frustration using Grid solution, some may be tempted to develop their own solutions • Fireman vs. LFC • Perceived as LCG vs. EGEE syndrome • Risk that non-HEP will loose confidence • Yaim vs. gLite deployment modules • Perceived as LCG vs. EGEE syndrome • Will be discussed at the next EGEE conference • Integration/Test process perceived too slow • Not true, just that the delivered software quality is pretty low requiring many iterations with developers, many testing cycles and staffing is low • Testing cluster is the heaviest user of Quattor at CERN • But avoids operations and others to hit the same problems • Risk that Developers will try to bypass formal processes LCG SC2 - October 14, 2005
Concerns and Risks (II) • Security • Has become recently a problem – but security is at the heart of many services • Delays in delivering libraries and code • E.g. delegation, g{s,l}exec • Causing related delays with other services relying on these libraries • Somewhat strange, as this activity was working very well at the beginning of the project • Managerial • No way to reassign personnel to work on priorities • Communication with partners • And sometimes within partners • Some weekly activity reports are very weak • Do not reflect what is happening • Most effective ways to work is through escalation • Causing (sometimes) unneeded stress and frustration • Prototype understaffed • Causing it to be often not functional • Testing/Integration understaffed (to fulfill their duties) • See related slide • People at the limit of burn outs even before LHC starts • Multiple reporting lines • Causing a lot of overhead LCG SC2 - October 14, 2005
Concerns & Risks (III) • Partners • Work is going on which is never really reported upon, but which gets “sold” to others • Ex: CREAM, G-Pbox, SGAS • Sometimes this work is in direct competition with other parts of the project • Causing frustration and unclear message for potential users • Agreed work is sold to experiments before it is released • E.g. WMproxy, G-Pbox • Causing frustration and bad reputation of software stack • Developers tend to report to their line management, not to the project • Weakens the project as a collaborative effort with common goals • Reduces the overall efficiency • New code is not (always) exposed on the “prototype” testbed • Contributes (sometime) to deliver poorly tested code • Too much liberty in choosing external dependencies • 97 external dependencies, this surely could be reduced • Software stack to complex, difficult to debug and manage • Lack of (Unit) Testing • Resulting in poor software quality before integration • Causing additional delays in delivery LCG SC2 - October 14, 2005
LCG 2Q05 report • Architecture and Design documents have been delivered • Interim Release on prototype has been late • PPS unlikely to deploy it • Refocused work on fixing Critical bugs • gLite 1.2 will be produced in July 2005. • There will be a gLite 1.3 beginning of August • FTS with multi-VO support • WMproxy with bulk job submission support • Incomplete Milestone 1.5.2.15 has been completed • File Transfer Service • Metadata Catalog • Not the jointly agreed PTF, ARDA, etc… • “Private” releases to address FTS issues observed from Service challenges LCG SC2 - October 14, 2005
LCG 2Q05 (II) LCG SC2 - October 14, 2005
Comments to the LCG 2Q05 report “components for workload management, in particular the computing element (CE), the workload management system (WMS), the logging and bookkeeping system (LB), the DGAS accounting system, as well as on the VOMS and proxy-renewal security components. “ It would be nice to see some of the details of what these components are. This is referring to the effort INFN has provided. Let me detail: • CE: Two areas of work, essentially integrating the Condor-C and BLAH (Batch Local Ascii Helper Protocol), the interface by which Condor-C on the CE side interfaces to batch systems such as LSF or PBS (not Condor). This work is on-going as new requirements or problems are found. I can expand if you want. Note this work is integrated back in the official VDT distribution. The other area of work is Cream a Web Services based completely new CE, but we are not using this. • WMS: The workload management system has been evolved to support Condor-C submission, to interface to R-GMA, to provide Web Services based interface to bulk job submission (collection, parametric jobs, etc…). Also provides multi-catalog support (LFC/DLI, Fireman/SI) This work should now be distributed as part of gLite 1.4. User description can be found in the updated manual pages at: https://edms.cern.ch/document/590869/1 • Logging and Bookkeeping Logging and Bookkeeping has been evolved to improve performance and to provide a Web Services interface dubbed LBproxy. LB is the mechanism used by the WMS to keep track of jobs during their lifetime but is also a more general mechanism that can be used by other components to keep state information. LB is scheduled to be part of VDT. Documentation can be found at https://edms.cern.ch/file/571273/1/LB-guide.pdf. • DGAS: DGAS is a distributed accounting system which originates from EDG but was never installed on LCG. It was agreed with SA1 to give evaluate it. DGAS documentation is available at https://edms.cern.ch/file/571271/1/EGEE-DGAS-HLR-Guide.pdf, https://edms.cern.ch/file/571271/1/EGEE-DGAS-Gianduia-guide.pdf and https://edms.cern.ch/file/571271/1/EGEE-DGAS-PA-Guide.pdf. DGAS is being coordinated with APEL and will feed record to APEL. • VOMS: VOMS has been evolved to comply with relevant RFC’s, correct bugs and provide Oracle support as requested by LCG. VOMS will be part of VDT and is already being build as part of NMI. • Proxy-renewal The mechanism to renew about to be expired proxies for running jobs has been provided in BLAHP on the Condor-C CE side. This will also be part of the VDT distribution. I can’t find this document: http://egee-jra1.web.cern.ch/egee-jra1/Bugs.doc • The page was moved to http://cern.ch/egee-jra1/glite/Bugs/Status.htm LCG SC2 - October 14, 2005
Comments to the LCG 2Q05 report (II) • 1. The WMS efficiency had been improving during the quarter, but now looks to be falling again. See http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php. It is important to somehow decouple intrinsic middleware performance tests, but these were on a very small scale as reported in the quarterly report. Has the scale of these WMS tests (decoupled from deployment tests) improved? Do you have a target for the middleware itself? What is the current status and what are your targets (in terms of tested delivery to the deployment area)? These statistics are really for GD to comment on. They do not always reflect the reality as they mix very different problems. The testing activity reports on success rates based on storms of 1000 jobs. With RetryCount=3, the failure rate is 0.14%. Also the LCG Test suite is ran against the release (successfully). There are no performance targets yet, testing is rather functional testing with the resources we have. I fully agree that performance targets would be very useful, but so far we have tried to get deployment, basic functionality and regressions test delivered rather. • 2. Also relating to WMS and the resources associated to a known problem area... The JDL from the job is not passed to the local system. Hence there is no way for the local scheduler to use info from the Grid scheduler. This is a limitation from a (shared) site viewpoint (attempting to balance Grid and local jobs). The short term solution is to set up separate batch queues. (It is not a limitation for the experiments (affects efficiency)). It is noted as a requirement and it is intended that this will be delivered in Year 2 of JRA1 for the WMS. What is the status of this part of the development? (It seems to me this is important not just for EGEE but also for LCG as [shared] Tier-2 sites start to be incorporated into Service Challenges etc.) This is referring to 7.2.8, forward requirements to local batch systems. There were many discussion on the rollout list, but no real agreement has been reached yet on this (difficult) subject. It is not obvious how to forward something consistent to different batch systems. There has been a recent evolution at the last HEPiX meeting, where WMS developers started discussions with system administrators (see http://www.slac.stanford.edu/conf/hepix05/talks/wednesday/prelz/). A conclusion is that some of the requirements could be implemented by the November time scale (developer estimates). • 3. The current Bugs list is excellent as a live document giving an idea of where the priorities lie. The problem (for me) is that having read it I am left unclear as to how bad the problems may be and/or who is working on them etc You might categorise them by adding two columns 1. (perceived) depth of problem 2. resources associated. This can then be used to assess/demonstrate where you anticipate most work is needed in the short term.. All these problems are critical. The PPS people consider for example (and correctly IMHO) lack of documentation as critical. These are reviewed on a weekly basis and non-critical problems are downgraded. It would be very difficult and tedious to track the resources. Additionally as it is not easy to move effort around, I wonder the real usefulness of this. I could provide some more subjective information in the quartely report, but it will then become very detailed. Would a “top 10 problems” be acceptable? LCG SC2 - October 14, 2005
LCG 3Q05 report • Test plan delivered as scheduled • https://edms.cern.ch/document/473264 • Additional gLite releases have been produced • gLite 1.3 now including • Bug fixes • FTS multi-VO support • Secure Fireman catalog (Oracle and MySQL) • File Placement Service • Not including WMproxy • gLite 1.4 • Bug fixes • FTS SRMcp support • WMproxy • DGAS Accounting LCG SC2 - October 14, 2005
LCG 3Q05 (II) LCG SC2 - October 14, 2005
Summary • Although many concerns still exist • In particular Integration and Testing • gLite releases have been produced • Tested, Documented, with installation and release notes • And subsystems used on • Service Challenges • Pre-Production Services • Production Service • And other communities • DILIGENT, BioMedical • gLite processes are in place • Closely monitored by various bodies • Hiding many technical problems to the end user • It must be realized however that deploying newly released code take some time, no risk should be taken breaking the production LCG SC2 - October 14, 2005