70 likes | 163 Views
Site Migration to WMS. ALICE TF Meeting 30/10/08. WMS Migration (I). Several security issues found at several sites using still the LCG-RB LCG-RB is a deprecated service since months Goal: Migration to the gLite3.1 WMS at all sites Scheduled by the middle of November
E N D
Site Migration to WMS ALICE TF Meeting 30/10/08
WMS Migration (I) • Several security issues found at several sites using still the LCG-RB • LCG-RB is a deprecated service since months • Goal: Migration to the gLite3.1 WMS at all sites • Scheduled by the middle of November • This is just a medium term approach: Waiting for the CREAM-CE deployment • Current situation: All sites migrated but: • French confederation • Still waiting for at least 1 WMS for ALICE in the country • Birmingham • They have confirmed the NIKHEF approach: Restricted access to 2 persons to the local VOBOX: Latchezar and me • Madrid • WMS will be provided by the 5th of November • UNAM • They have confirmed a WMS for Alice use • Kolkota • No news • In addition: NIKHEF has ensured a WMS for Alice in November. In the meantime the site has been configured with the SARA WMS
WMS migration (II) • The WMS migration requires certain tunes in the ALICE submission approach • We must include a new field into the pilot jdls • This is what we have: RetryCount = 0;(deep resubmission) • This is what we still miss: ShallowRetryCount = 0; (shallow resubmission) • Differences: The resubmission is deep when the job fails after it has started running on the WN, and shallow otherwise • The lack of this argument exposed us to a problem at CERN last week: • An error in the CE YAIM configuration at CERN mapped all alicesgm users to non-existing accounts • In default the ShallowRetryCount value is set to 10. Until 10 times pilots were resubmitted before aborting
WMS Migration (III) • WMS service provides jobs with a new feature (not included in RB): • In the case that the required queue is not available, the job does not die. It will be kept for a certain time (configurable) and resubmitted (case 1) • This is also the case if the WMS is temporary overloaded (case 2) • Following the submission approach of ALICE, this can be a mess • Configured at CERN and reduced to 2h • It is working fine • This configuration is service based and not VO based • If ALICE shares the WMS with other VOs which have opposite requirements (!)
WMS Migration (IV) • Case 1 • Workaround: SAM • Implementation of a new test, WMS sensor related • A dummy job will be sent each 30min (1h) per each site • Once submitted it will check 10min later the status of this job • If still « waiting » most probably the WMS is suffering from any overloading issue • The WMS will be then removed from the VOBOX • Case 2 • A «drain flag» definition is foreseen for the WMS • In this case if one WMS is overloaded, the submission will pass automatically to the 2nd WMS defined • This is true if the list of WMS follows a load balance approach
WMS Migration (V) • The current failover mechanism is not enough for the WMS use • Use RB1. If it fails… • Use RB2. If it fails…. • Use RB3 • The current mechanism is not able to identify an overloaded WMS • In order to explote all the potential of the drain flag feature we should be: • Use RB1 OR RB2 OR RB3. If all these WMS fail… • Use RB4 OR RB5 OR RB6
WMS Migration (VI) • The defined code is now implemented at CERN and in Torino • LDAP configuration • wms1;wms2,wms3;wms4 1st group 2nd group • Into the VOBOX, this means the following: • $HOME/alien-logs/wms103.cern.ch;wms109.cern.ch.vo.conf • Where this files looks like as: [ VirtualOrganisation = "alice"; WMProxyEndpoints = {"https://wms103.cern.ch:7443/glite_wms_wmproxy_server","https://wms109.cern.ch:7443/glite_wms_wmproxy_server"}; MyProxyServer = "myproxy.cern.ch"; ]