530 likes | 548 Views
gLite Status. Erwin Laure Deputy EGEE Middleware Manager On behalf of and with contributions from all JRA1. Contents. Integration and Testing Status of gLite components Interoperability with OSG/Grid3 SA1 requirements follow-up. Integration Overview. Activities split in four main areas:
E N D
gLite Status Erwin LaureDeputy EGEE Middleware Manager On behalf of and with contributions from all JRA1
Contents • Integration and Testing • Status of gLite components • Interoperability with OSG/Grid3 • SA1 requirements follow-up 2nd EGEE Conference, Den Haag 2
Integration Overview • Activities split in four main areas: • The Build Servers and the Integration Infrastructure • Quality Assurance • Packaging and Installation • Configuration and service instrumentation • A precise release process is followed as described in the project SCM Plan • https://edms.cern.ch/document/446241 • The guidelines for development, quality and configuration are described in the Developer’s Guide and the Configuration Guidelines Proposal • https://edms.cern.ch/document/468700 • https://edms.cern.ch/document/486630 2nd EGEE Conference, Den Haag 3
Build System • One nightly build server on RH Linux 3.0 • Clean builds out of HEAD every night of all components • Packages (tarballs and RPMS) are published to the gLite web site • Tagged every night and totally reproducible • One continuous build server on RH Linux 3.0 • Incremental builds out of HEAD every 60 minutes • Results published to CruiseControl web site • One nightly build server on Windows XP • Clean builds every night of all components (Java components build, C/C++ not yet) • No results published yet. Goal is to have the clients available on Windows XP for gLite 1.0 • Integration builds are produced every week • Based on developers tags or nightly build tags • Guaranteed to build • Contain additional documentation (release notes, installation and configuration instructions • Official base for tests • All packages (tarballs and RPMS) plus installation and configuration scripts are published on the gLite web site 2nd EGEE Conference, Den Haag 4
Quality Assurance • Quality assurance tools are integrated in the build system and CVS • Coding guidelines: Jalopy (Java), CodeWizard (C/C++) • Unit tests: JUnit (Java), CppUnit (C++) • Coverage: Clover (Java), gCov (C++) • Reports are not yet published, but will soon be added to the gLite web site. Some of them are currently available from the CruiseControl servers • For the moment we generate only warnings, but we can raise the quality requirements at any time and prevent commits or builds if necessary (to be agreed within the project) 2nd EGEE Conference, Den Haag 5
Packaging and Installation • Source tarballs, binary tarballs and RPMS are automatically generated at every build • MSI packages for Windows will be created soon for those components already building on Windows (all Java components) • External dependencies are repackaged only if really necessary, otherwise we only use officially released RPMS • Python RPM installation scripts are available with every build. They can be used to easily install all components required by node with all its dependencies out of the gLite web repository • Quattor RPM templates are also automatically produced with the build. Currently used internally by the test team • Documentation, installation and configuration scripts form a deployment module • Exist in different granularities: service and node 2nd EGEE Conference, Den Haag 6
Configuration • A common configuration model has been proposed • Guidelines and prototypes are available • The guiding principles are: • Limit the number of configuration files as much as possible: at the moment typically only two XML configuration files and one script per node are necessary. Also limit the number of environment variables and modifications to PATH and LD_LIBRARY_PATH • Group the parameters by function and scope: three levels are used, user parameters (to be supplied by the sysadmin/user), optimization parameters (default values are provided) and system parameters (better not to touch) • Unify the interfaces and build instrumentation and monitoring in the services from the beginning: we have proposed a single service instrumentation interface with different implementations depending on language and platform. Migration is started, but the issue is still controversial 2nd EGEE Conference, Den Haag 7
DM Developersannounce every Friday what components are available for a release according to the development plan. Components are tagged, list is sent to Integration Team Components IT/CZ UK ITeam put together all components, verifies consistency and dependencies and add/update the service deployment modules (installation and configuration scripts). The build is tagged as IyyyyMMdd Integration Builds ITeam Test Teams Test The integrated build is deployed in the testbeds and validates with functional and regression tests, test suites are updated, packaged and added to the build. If the build is suitable for release, release notes and installation guides are updated, the build is retagged (RCx, v.0.x.0) and published on the gLite web site for release to SA1 Pre-production SA1 Release Process 2nd EGEE Conference, Den Haag 8
Next Steps • Complete all deployment modules for RC1 • Complete the configuration files and scripts and thoroughly verify that all guidelines are respected • Write full release notes for all services • Write full installation and configuration instructions for all services • Go through a number of verification and feedback iterations with Test Team and SA1 • Release final gLite 1.0 RCx to testing in January 2005 2nd EGEE Conference, Den Haag 9
Testing process • Distributed testing testbed across three sites • All run a binary compatible version of Red Hat Enterprise Linux • CERN: SLC3 • NIKHEF: CentOS 3.2 • RAL: Scientific Linux • Deploy and test Integration builds • Automatic installation at all sites using quattor or kickstart • gLite component installation via deployment modules • Configuration using post installation configuration scripts • gLite Testsuites • Build validation run on all rpms nightly builds • Functional tests run on distributed testbed • http://cern.ch/egee-jra1-testing-reports/ • Bug reporting - savannah • https://savannah.cern.ch/bugs/?group=jra1mdw In addition to this structured testing all components discussed in the following are being tested since May on the prototype installation by application users and developers 2nd EGEE Conference, Den Haag 10
Testing status (1/4) • WMS • Successfully deployed a WMS on the testbed using official gLite rpms and following available instructions on Friday 19 Nov • Basic job submission works • Testing using Fake BDII – no information system integration yet • Next integration build to be deployed across all sites • Updating and testing post installation configuration scripts to produce a correct and reproducible deployment of the WMS. • Major bugs: • 5383: The glite-job-* commands do not work with a VOMS proxy. The AC signature is not verified correctly, There appears to be an incompatibility with the information returned from voms-proxy-info and what the WMS expects. 2nd EGEE Conference, Den Haag 11
Testing status (2/4) • CE • Successfully deployed with pbs on testbed • Basic job submission via blah to pbs works, no further testing yet • Will deploy CE from next integration build at all sites • L&B • Successfully deployed at CERN • Will test L&B at different site in next integration build • No extensive testing yet (dependency on WMS) • R-GMA • Successfully deployed across the distributed testbed • Under test, no current showstopper bugs 2nd EGEE Conference, Den Haag 12
Testing status (3/4) • gLite IO • Successfully deployed and tested with a castor SRM • Test beginning with dCache SRM at RAL • Performance and stress testing underway • Catalogs, FPS. FTS • Initial test development beginning on prototype testbed • No deployment modules available yet • AliEn components • Extensive testing of job submission on pt testbed early on • Many bugs reported and solved • Installation on testbed was difficult due to lack of comprehensive instructions • No deployment modules available yet for AliEn components 2nd EGEE Conference, Den Haag 13
Testing status (4/4) • VOMS • Still no successful server installation on RHEL • voms-proxy-* client tools installed on testbed • Testing voms proxies with WMS and RH7.3 server • Major bugs: 5582, 5505 • #5505, #5582: voms-proxy-* commands are not backwards compatible with grid-proxy-* commands • #5489, voms proxies created from expired certificates • Package Manager • No testing begun yet • No deployment modules available yet • Accounting • No testing begun yet • No deployment modules available yet 2nd EGEE Conference, Den Haag 14
Updated Schedulefor pre-production service • gLite I/O – Available • Logging & Bookkeeping, WMS, CE, WN – In testing – end November • R-GMA – In testing – mid December • CE-Notification – In integration/dev (WMS) – January • Replica, File, Combined Catalog – In development – January • File Transfer Service – In integration – January • File Placement Service – In integration – January • VOMS – In integration/testing – January • UI – In integration – January • AliEn TQ & CE – In integration – see following • discussion • Package Manager – Discussions w/experiments, deployment – prototype exists • Grid Access – Prototype exists – discussions on semantics ongoing • Accounting (HLR) – In integration – Prototype exists • Job Provenance – Proof of concept exists • Data Scheduler – In development 2nd EGEE Conference, Den Haag 15
Potential Services for RC1 WSDL clients; APIs, CLIs | Alien shell GAS prototype VOMS R-GMA | AliEn ldap PMprototype PKIGSImyProxy AliEn FC Local RC GKCondor-CBlahpCEMon | AliEnCE GenericInterface DGAS AlienSEglite-I/OgridFTPSRM FTS FPS | AliEn DS DS WMS | AliEn TQL&B 2nd EGEE Conference, Den Haag 16
Components for RC1 & Open Issues • Workload management system (WMS) • Task queue and information supermarket (ISM) • Works in push and pull mode • ISM adaptor for CEMon • Query of File Catalog • Computing Element (CE) • Globus gatekeeper, condor-C, blahp to LSF and PBS • CEMon (pull component) • Security, user mapping: LCAS/LCMAPS, DAS • Logging and Bookkeeping (L&B) • Information Service: R-GMA • pre-WS version Blue: deployed on prototype and released to integration and testing Orange: in development 2nd EGEE Conference, Den Haag 17
Components for RC1 & Open Issues • Catalogs • AliEn file catalog, local replica catalogs (on Oracle and mySQL) • Fireman interface • Messaging system for updating the FC • Data Management • File Placement/File Transfer Service • glite-I/O • Data Scheduler • VOMS • Installation on SL3 • DGAS Accounting 2nd EGEE Conference, Den Haag 18
Components for RC1 & Open Issues • Package Manager • Grid Access Service (GAS) • Prototype • AliEn • Task queue, CE, SE, shell • What is the impact of Alice deployment 2nd EGEE Conference, Den Haag 19
Impact of Alice deployment • JRA1 has to support the deployment of “prototype software stack” on 30+ Alice sites • Unclear whether this means *all* of the prototype software stack • Prototype is an evolving system with components dumped there but not necessarily fully interworking • Need a precise description of what Alice really wants to deploy • Software will not have gone through full JRA1 integration & testing • Building in build system, but no configuration adoptions and only partial documentation • JRA1 is not a deployment project • Will get additional manpower • Most of the work is supposed to be done by Alice • Any help from SA1, in particular for user support? • Deployment experience will be fed back into the integration and testing process • Additional manpower is expected to improve documentation, packaging etc. while working on the deployment • How much effort from JRA1 integration and testing teams is needed and when? 2nd EGEE Conference, Den Haag 20
Priorities for RC1 • With the current manpower it will not be possible to provide all components with the required level of integration and testing • Need prioritization from SA1 and NA4 • Alice deployment will feed back to integration and testing (in particular AliEn components) • Dedicated effort from JRA1 and Alice exists • How much effort from integration and testing teams is needed and when? • Possible scenarios: • Focus JRA1 integration and testing on AliEn components to support deployment • Deployment cannot start “today” • What to do with components not a priority for Alice but for other applications? • Evolvement of LCG-2 based pre-production service will be delayed • Continue delivery to pre-production service as planned • Start Alice deployment without going through full integration and testing • Can start “today” • Dedicated deployment team works on integration and testing • Minimize involvement of integration and testing team • Unclear whether feedback will be in time for RC1 • But RC1.1. can be tagged any time the deployment exercise produced results 2nd EGEE Conference, Den Haag 21
Interoperability with other Grids • Interoperability mainly needed at resource level • Same physical resource should be exploitable in different Grids • Approach • Reduce requirements on sites • CE: globus gatekeeper • SE: SRM • Close connection with other projects • OSG • use EGEE architecture and design documents as basis for their blueprint • Common members in design teams (needs probably enforcement) 2nd EGEE Conference, Den Haag 22
SA1 requirements • What has been done since Cork: • Platform support agreed: For this year, the platform support will be mainly oriented to Linux based (RH Enterprise 3.x or another Binary compatibly distribution based on the Sources of RH 3.0 like Scientific Linux, CERN Linux etc. ) 32-64 bits platforms. Windows remains as secondary platform • Integration infrastructure: RHES 3.0 and Windows XP • Testing testbed: SLC3 (CERN), CentOS 3.2 (NIKHEF), Scientific Linux (RAL) • Operational requirements defined • Update from SA1 on handling of external dependency requirements (detailed note distributed) • Installation/configuration/release mgt requirements being implemented by JRA1 2nd EGEE Conference, Den Haag 23
Main SA1 requirements being implemented • Source and binary RPMs and tarballs (diff packaging formats) being delivered based on component, services and nodes (different granularity). Windows packages will come soon. • Mw configuration presented in integration slides • External dependencies common to all components: • as much as possible only standard, official RPMS • When we need to modify something, the RPMS are installed in the $GLITE_LOCATION/externals directory to avoid conflicting with other existing packages • Relocatable packages: not all yet fully compliant, automatic test suite to check it (see testing web page) • Reduce components needed on WNs: 3 high-level components needed (gLite I/O client, LB client, R-GMA clients) plus the WMS checkpointing library and/or AliEn clients 2nd EGEE Conference, Den Haag 24
In progress • Common administrative interfaces for all grid services • Proposal exists • Standardized error/log messages • Proposal exists • One accounting interface for all services • Needs more discussion with SA1 in particular wrt current accounting • Traceability: logging information and operations API have to allow tracing activities of logs back to the source • ongoing • Scalability: services deployable in an scalable way • Distributed services (e.g. catalogs), site autonomy • Critical services must be redundant • Reduce the need for ‘single central’ services • Service/site autonomy: avoid single point of failure (timeouts, retries, work locally and resynchronize later) • Site autonomy one of the guiding principles • Exception handling: services have to be prepared to handle non-standard situations in a graceful way • ongoing 2nd EGEE Conference, Den Haag 25
Summary • JRA1 intends to have software for release candidate 1 by end of December • An intense integration and testing period follows • The first release of gLite is due at the end of March 2005 • Prioritization needed with SA1 and NA4 2nd EGEE Conference, Den Haag 26
Links • JRA1 homepage • http://egee-jra1.web.cern.ch/egee-jra1/ • Architecture document • https://edms.cern.ch/document/476451/ • Release plan • https://edms.cern.ch/document/468699 • Prototype installation • http://egee-jra1.web.cern.ch/egee-jra1/Prototype/testbed.htm • Test plan • https://edms.cern.ch/document/473264/ • Design document • https://edms.cern.ch/document/487871/ • gLite homepage: • http://www.glite.org/ 2nd EGEE Conference, Den Haag 27
Backup Slides • The following slides show details on • SA1 requirements follow-up • the gLite components 2nd EGEE Conference, Den Haag 28
SA1 requirements follow-up • What has been done since Cork: several SA1-JRA1 requirement meetings: • Platform support • Operational requirements (new) • Update on handling of external dependency requirements • Contents of this presentation: • Update on the status of the previously defined requirements (presented at Cork) • Presentation and status update of operational requirements 2nd EGEE Conference, Den Haag 29
SA1 requirements • Mw delivery process: • Tarball -> SA1 certification -> JRA1 packages exactly the certified version • Present status: RPMs delivered (tarballs exists, but not tested) • Short term plan: tarball / RPMs / Windows packages • Granularity of delivered components: • Keep the granularity of components and packages as fine as possible • Present status: based on services and nodes • A service is a group of components providing some functionality (WMS, CE, I/O Server, etc) • A node is a set of services to be deployed together 2nd EGEE Conference, Den Haag 30
SA1 requirements • Release management: • Quick turnaround for bugs and security patches; bug fixes provided to all versions run by SA1 • Present status: weekly integration builds which are release candidates; only bug fix experience is with the prototype, none with SA1 yet • Short term plan: we need some experience • Deployment scenarios: • JRA1 will deliver deployment recommendations for services as part of a release, and define the minimum running requirements for the entire system • Present status: included in the installation guide 2nd EGEE Conference, Den Haag 31
SA1 requirements • Mw configuration (1): • Keep mw installation and configuration separated • Present status: separated installation and configuration scripts • Mw configuration (2): • Provide simple and tool independent configuration mechanism • Present status: Python configuration scripts • Short term plan: A common configuration method is being finalized for all services (depending on language, platform and technology used). See Integration slides 2nd EGEE Conference, Den Haag 32
SA1 requirements • Mw configuration (3): • JRA1 will provide a standard set of configuration files and documentation with examples that SA1 can use to design tools. Format to be agreed between SA1-JRA1 • Present status: Python configuration scripts • Short term plan: See Integration slides • Mw configuration (4): • Classify the configuration parameters and give sensible default values • Present status: Default values given; no classification of parameters done yet • Short term plan: See Integration slides 2nd EGEE Conference, Den Haag 33
SA1 requirements • Deployment platforms: • For this year, the platform support will be mainly oriented to Linux based (RH Enterprise 3.x or another Binary compatibly distribution based on the Sources of RH 3.0 like Scientific Linux, Cern Linux etc. ) 32-64 bits platforms. Windows remains as secondary platform • Present status: • Integration infrastructure: RHES 3.0 and Windows XP • Testing testbed: SLC3 (CERN), CentOS 3.2 (NIKHEF), Scientific Linux (RAL) 2nd EGEE Conference, Den Haag 34
SA1 requirements • Worker nodes: • Reduce to the minimum the components to be run in the Worker Nodes and make them easily portable • Present status: • Only clients need to be run on WNs, no services • Third party software (1): • Avoid using multiple versions of the same libraries, tools, external programs • Present status: external dependencies are common to all components • as much as possible only standard, official RPMS • When we need to modify something, the RPMS are installed in the $GLITE_LOCATION/externals directory to avoid conflicting with other existing packages 2nd EGEE Conference, Den Haag 35
SA1 requirements • Packaging: • All EGEE software should be relocatable • Present status: Automatic test run nightly to check it. Not all RPMs are compliant yet: • http://egee-jra1-testing-reports.web.cern.ch/egee-jra1-testing-reports/deployment_testing/installation/latest.html • Short term plan: All packages will be made relocatable • Software distribution: • JRA1 will provide a release packaged in the native format of all supported platforms and a tarball. Both source and binaries would be provided • Present status: First component release composed of source and binary RPMs and tarballs • Short term plan: distribute also Windows packages 2nd EGEE Conference, Den Haag 36
SA1 requirements • External dependency handling: • The middleware should indicate all dependencies on external packages • Any external package not supplied with the OS should be made available separately • If a required external package conflicts with the OS, the conflict shall be noted and effort made to remove the conflict. Same applies to external packages that require patches. • Avoid using multiple versions of the same libraries, tools, external programs • Present status: external dependencies are common to all components • as much as possible only standard, official RPMS • When we need to modify something, the RPMS are installed in the $GLITE_LOCATION/externals directory to avoid conflicting with other existing packages • All these is included in the build system 2nd EGEE Conference, Den Haag 37
SA1 operational requirements • Common administrative interfaces: • Grid components should have a common administrative interface (monitoring, alarms; no central entity managing all services, instead minimum basic common set of APIs implemented by all the services) • Status: planned • Service state: • Keep service state in an “external” repository (separated from the service itself). It can be a repository per service, doesn’t need to be a central one (if it is a DB or not, implementation detail) • Status: plans to publish service status into R-GMA 2nd EGEE Conference, Den Haag 38
SA1 operational requirements • Verification of Service Level Agreements (SLA): • accounting information provided has to be extensible, as the SLAs can change over time. We want the services to keep enough information to verify the SLAs, how we do this is not clear, maybe via a combination of monitoring and accounting • Status: SLA handling unclear at the moment • Standardized error/log messages and files: • Log level has to adaptable for debugging. Minimal level has to guarantee audit trails. The logging message should be identifiable: originator, time stamp. • Status: proposal worked out 2nd EGEE Conference, Den Haag 39
SA1 operational requirements • Accounting: • Services need ONE accounting interface • Status: need more interaction with SA1, in particular wrt existing accounting infrastructure • VO policies: • A mechanism to express VO policies, transfer them to the sites and implement them is needed. • Status: Being done in context of CE development 2nd EGEE Conference, Den Haag 40
SA1 operational requirements • Complex resource usage policies: • The middleware has to be able to deal with complex resource usage policies • Status: being addressed in new CE developments • Traceability: • The logging information and the operations API have to allow tracing activities of jobs back to the source (this implies for privacy and security reasons that a subset of these APIs need a strong authentication and authorization mechanism • Status: ongoing 2nd EGEE Conference, Den Haag 41
SA1 operational requirements • Scalability: • Services have to be designed so that they can be deployed in a scalable way. Services should be scalable in a way that is transparent to the user. • Status: work towards distributed services (e.g. catalogs) • Redundancy: • Critical services must be redundant. Critical services need to be quickly restarted and keep most of the state information. State information of services has to be kept in persistent, easy to localize storage • Status: being done in design of WMS, data scheduler, catalogs, etc. 2nd EGEE Conference, Den Haag 42
SA1 operational requirements • Service/site autonomy: • Avoid single points of failure. if a central service fails, it should not affect to any of the local services running on the site. When the central service is back the site reconnects, resynchronizes and continues normal operation, instead of just crashing because the central service died. Need for timeouts and retries. • Status: being done by design of catalogs, etc. site autonomy is one of the guiding principles • Exception handling: • Services have to be prepared to handle non-standard situations in a graceful way • Status: needs to be done 2nd EGEE Conference, Den Haag 43
SA1 operational requirements • VOs: • Adding and removing a VO has to become a lightweight operation . • Status: need specific discussion w/SA1 • Batch systems: • First one: LSF; Second one: being investigated by SA1 (Torque, Maui) • Status: LSF and PBS currently supported 2nd EGEE Conference, Den Haag 44
WMS 2nd EGEE Conference, Den Haag 45
CE Works in push and pull mode Site policy enforcement Exploit new globus GK and CondorC (close interaction with globus and condor team) CEA … Computing Element Acceptance JC … Job Controller MON … Monitoring LRMS … Local Resource Management System 2nd EGEE Conference, Den Haag 46
Data Management Scheduled data transfers (like jobs) Reliable file transfer Site autonomy SRM based storage 2nd EGEE Conference, Den Haag 47
Storage Element Interfaces • SRM interface • Management and control • SRM (with possible evolution) • Posix-like File I/O • File Access • Open, read, write • Not real posix (like rfio) Control SRM interface POSIXAPI File I/O User rfio dcap chirp aio dCache NeST Castor Disk 2nd EGEE Conference, Den Haag 48
Replica Catalog Site A Replica Catalog Site B LFN LFN GUID GUID SURL SURL SURL SURL Catalogs • File Catalog • Filesystem-like view on logical file names • Keeps track of sites where data is stored • Conflict resolution • Replica Catalog • Keeps information at a site • (Meta Data Catalog) • Attributes of files on the logical level • Boundary between generic middleware and application layer Metadata Catalog Metadata File Catalog GUID Site ID LFN Site ID 2nd EGEE Conference, Den Haag 49
Job wrapper Job wrapper Job wrapper MPP MPP MPP DbSP Information and Monitoring • R-GMA for • Information system and system monitoring • Application Monitoring • No major changes in architecture • But re-engineer and harden the system • Co-existence and interoperability with other systems is a goal • E.g. MonaLisa e.g: D0 application monitoring: MPP – Memory Primary Producer DbSP – Database Secondary Producer 2nd EGEE Conference, Den Haag 50