190 likes | 310 Views
ALICE job submission system - use and status of the CREAM-CE. Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009). Introduction. ALICE is interested in the deployment of the CREAM-CE service at all sites which provide support to the experiment
E N D
ALICE job submission system - use and status of the CREAM-CE Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
Introduction • ALICE is interested in the deployment of the CREAM-CE service at all sites which provide support to the experiment • GOAL: Deprecation of the WMS use in benefit of the direct CREAM-CE submission • WMS submission mode to CREAM-CE not required • ALICE has began to test the CREAM-CE since the beginning of Summer 2008 into the real production environment • For the time being, ALICE is the only LHC experiment performing stress and real tests to the CREAM-CE This talk will focus on the ALICE experiences using CREAM-CE, the expectations, future plans and requirements for all the sites ALICE Offline Week -- CREAM-CE Use and Status for ALICE
The CREAM-CE • CREAM (Computing Resource Execution And Management) lightweight service for job management operations at the CE level • Called to be the replacement of the current LCG-CE • Submission procedures allowed by CREAM: • Submissions to CREAM via WMS • Via generic clients which allow direct submission • The submission method depends basically on the experiment computing model • Normally pilot based follows the direct submission mode approach (4 LHC experiments) • Bulk submissions of real jobs follows the WMS submission approach (CMS) ALICE Offline Week -- CREAM-CE Use and Status for ALICE
Direct Submission to CREAM-CE • Extra elements required for direct submission • Proxy renewal mechanism (required by CMS and ATLAS) • Responsible to automatically renew the user proxy if expiring • Already (recently) available • The lack of this element is not a showstop for ALICE • 48h voms extensions ensured by the security team@CERN • Enough to run production/analysis jobs without any addition extension ALICE Offline Week -- CREAM-CE Use and Status for ALICE
The 1st test phase • Performed in summer 2008 at FZK (T1 site, Germany) • Tests operated through a second VOBOX parallel to the already existing service at the T1 (operating in WMS submission mode) • Access to the local CREAM-CE was ensured through the PPS infrastructure • Initially 30 CPUs • Moved to the ALICE production queue in few weeks (production setup) • Intensive functionality and stability tests from July to September 2008 • Production stopped to create and ALICE CREAM module into AliEn and to allow the site to upgrade their system • Excellent support from the CREAM-CE developers and the site admins • Specially Massimo Sgaravatto (INFN-Padova) and Angela Poschlad (GridKa T1 site) 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 5
Results of the 1st test phase Running on the production queue Running on PPS nodes More than 55000 jobs successfully executed through the CREAM-CE in the mentioned period No interventions in the VOBOX required in the testing phase CREAM-CE used to distribute real (standard) ALICE jobs 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 6
Implementation into AliEn (I) • Creation of a new CREAM module • Specific for CREAM-CE submissions • Available since AliEn v2-16 • In parallel with the usual LCG module (restricted to WMS submissions only) • Change on the jdl construction • The current ALICE jdl contained the outputsandbox field which specifies the standard outputs of the job agents • CREAM-CE requires a new jdl field which declares the gridftp server where to retrieve the standard outputs • ALICE PROCEDURE: to remove the outputsandbox field of the jdl files created by the CREAM module • Only available in case of submission in debug mode 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 7
Implementation into AliEn (II) • gridftp server is required • Required to retrieve the standard outputs of the job agents • Sites are free to decide ist implementation (proposal: VOBOX) • 200 GB of space required • It will be used ONLY if the submission has been done in debug mode • Change on the proxy renewal mechanism • Submision optimization purpose • The user proxy will be renewed only once per hour • In previous AliEn version this procedure was executed BEFORE each agent submission • The procedure has been implemented ALSO in LCG.pm 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 8
The 2nd test phase • After a debug phase of the CREAM module in January 2009, the new CREAM module in production the 19th of February (2nd testing phase started) • Stability and performance are currently the most important test issues at the sites providing CREAM-CE • The deployment of a 2nd VOBOX ensures that the production will continue on parallel through the WMS • A unique VOBOX would require a dedicated babysitting of the system (not realistic) • Feedback of all issues are directly provided to the CREAM developers • As of today, 11 sites are providing CREAM CE 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 9
Site queues Status of the queues 2nd VOBOX VOBOX with clients General Status 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 10
Status of the sites (I) • FZK • Minor actions required during the 2nd phase test • Delete some sandbox directories (hitting file limit again 32K subdirs) • Procedure not neccessary in the next CREAM versions • 46530 jobs since the 19th of Feb through the FZK CREAM-CE • RAL • No special actions reported by the site for service maintenance • 2678 jobs executed using the local CREAM-CE • Kolkata • Debugging phase performed directly with the developer (Massimo Sgaravatto) • In production from 9th of March Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 11
Status of the sites (II) • CERN • Two CEs have been provided the 9th of March to ALICE for testing • In production since the 10th of March (voalice03 used for this production) • SLC5 WNs behind the CREAM-CE • 17247 jobs since the 10th of March • GSI • Still pending the setup of a 2nd VOBOX • The CREAM-CE performing well • CNAF • CREAM-CE ready to enter production at the end of February • After some instabilities observed last week (lack of automatic purge, entered the production back the 13th of March) • Info provider of the CREAM-CE showing certain instabilities Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 12
Status of the sites (III) • KISTI • Instabilities at the VOBOX level prevents the full setup of the local CREAM-CE in production • CREAM-CE system performing well • ATHENS • The CREAM-CE is working but the site cannot be put in production • No CREAM clients on the VOBOX • IHEP • CREAM-CE is not working yet (siter admin working on) • Missing infrastructure - no 2nd VOBOX (it will be provided next week) Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 13
Status of the sites (IV) • SARA • System tested yesterday evening with some few jobs • Still in testing phase • Torino • System in production since last week • Already 744 jobs executed through the local CREAM system • Subatech • 2nd vobox already provided, the setup of the CREAM-CE is ongoing Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 14
Reminder: How to provide CREAM-CE services for ALICE • During the last October pre-GDB meeting it was explicitly mentioned: • Unlikely to be deployable as an lcg-CE replacement on this timescale (downtime period), but we can continue with rollout in parallel. • In addition during the November pre-GDB meeting it was concluded: • The lcg-CE replacement will required the WMS submission in place and the resolution of the proxy renewal issue (among more other points related to the service performance) • It was encouraged however the deployment of the system in parallel to the LCG-CE Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 15
Reminder: How to provide CREAM-CE services for ALICE (II) • The parallel LCG-CE vs. CREAM-CE setup in terms of ALICE computing model means the deployment of a 2nd VOBOX • Each VOBOX is able to submit to a specific backend • One VOBOX LCG-CE OR CREAM-CE submission: replacement approach • Two VOBOXES LCG-CE AND CREAM-CE submission: parallel approach • This is a temporary solution during the parallel running phase • As soon as the replacement is ensured and the LCG-CE is deprecated ALICE will not required a 2nd VOBOX • Remarks for the 2nd VOBOX deployment • Its setup is not sign with blood • Each case can be studied individually • BUT! Sites with important Storage capability for ALICE should be included in the list of sites providing a 2nd VOBOX Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 16
Reminder: How to provide CREAM-CE services for ALICE (III) • Setup of the ALICE production queue behind the CREAM-CE • This procedure puts the CREAM-CE directly in production • GridFTP server • Required to retrieve the job (agent) outputs • Removed from the VOBOX in January 2008 with the deployment of the gLite3.1 VOBOX • It was not longer required by the 4 LHC experiments at that time • No specific wish for the placement of this service • It can be provided into the VOBOX but this site decision Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 17
Future Plans • Small changes in the CREAM module are still needed • The current implementation of the CREAM-CE via CLI allows the declaration of a single queue only • Sites can provide several queues per site (moreover T0/T1 sites) • The implementation of submission to several queues must be done to the application level • PROPOSAL for ALICE (in 3 lines of code): • Definition of a range per queue at the LDAP level • Calculation of a random number before each agent submission • Assignment of a queue based on the random number/range matchmaking Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 18
Conclusions • The ALICE experience with the current CREAM-CE service is very positive • Stable (and maintenance-free) operation is achieved quickly after the initial debugging period • High performance and scalability (FZK 2000+ parallel jobs) served by a single CREAM-CE • Excellent support provided by the developers • Special thanks to Massimo Sgravatto (INFN Padova) • ALICE is working with all sites to install a CREAM-CE • In full production before start of data taking Site queues 18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 19