150 likes | 286 Views
Preliminary tests with co-scheduling and the Condor parallel universe. Marian ZUREK for WP2. ETICS All Hands meeting Bologna, October 23-25, 2006. What’s the …. Context Use case Past Condor / NMI setup Results gLite-specific issues Next steps Discussion. Context.
E N D
Preliminary tests with co-scheduling and the Condor parallel universe Marian ZUREK for WP2 ETICS All Hands meeting Bologna, October 23-25, 2006
What’s the … • Context • Use case • Past • Condor / NMI setup • Results • gLite-specific issues • Next steps • Discussion Bologna -- All Hands Meeting
Context • The gLite software stacks testing activity is highly manual, so the motivation came for the process automation and ease of reproducibility • In the future the system tests should become the part of the release process (reports stored in the DB, easily accessible for the trends creation, performance analysis, bug reproduction, etc.) Bologna -- All Hands Meeting
Web Application NMI Scheduler Web Service Service Overview Via browser Build/Test Artefacts Report DB Project DB Command- Line tools Clients NMI Client Wrapper WNs ETICS Infrastructure Continuous Builds Bologna -- All Hands Meeting
Use case for gLite • We have to deploy six services on six different nodes UI, CE, WMSLB, VOMS_mysql, RGMA, WN • There are interdependencies between them • UI: [RGMA, VOMS_mysql, WMSLB, CE, WN] • CE: [WMSLB, WN] • WMSLB: [VOMS_mysql] • VOMS_mysql: [RGMA] • WN: [CE] • RGMA: [] • No auto discovery possible, order of service startup must be preserved, run-time environment defined • The successful “real job” submission requires all the services being operational Bologna -- All Hands Meeting
Back to the past • The gLite software stack requires the administation rights on the target node, so the root-enabled schema has been developed to address this • root enabled jobs should be performed only on the predefined sets of hosts • The service installation should reflect its operational status by writing to the file e.g. /etc/nmi/publish_services.list • runs_VOMS_server="true", timeOut=3600 • runs_RGMA_server="true” • The timeOut (expressed in seconds) defines the service operational time. After the timeOut node will be released. Absence of the timeOut will mean that the machine is released immediately after the job has been finished. Bologna -- All Hands Meeting
Condor / NMI setup • Experiment on the predefined set of nodes • special STARTD expressions for defining the Condor VMx availability • nodes still available for the regular submissions • synchronisation using Condor-chirp messages • Custom (outside the NMI/Condor) scratching mechanism: • watchdog style (outside process monitoring the node’s “limbo” state) • Initial trouble with lost/stuck jobs resolved with extra wait time • Node down-time < 10mins • Very good candidate for the virtualisation as no re-installation is needed (simple VM restart is enough) Bologna -- All Hands Meeting
Results • Using the NMI and Condor parallel universe we were able to address the above described scenario • The delays were minimal and experimental timeOuts adjusted for optimal performance • The developed code could be consulted in the CVS, module: org.etics.nmi.system-tests • Non-conditonal persistency: The node on which the service runs remains operational for the predefined set of time • Sleep appended to the code • expiry-time communication via NMI/Hawkeye module Bologna -- All Hands Meeting
Results • Conditional persistency : the node should be frozen in case the job fails (not implemented yet, but easy). • Failure propagation: should one of the parallel tests fail - the whole job flow is immediately aborted • Set of parallel job nodes exits immediately when node_0 job exits (let the node_0 be the “last” in the chain) • Output format definition is up to the submitter Bologna -- All Hands Meeting
Results • Context, name spaces - assured thanks to the Condor/NMI design • Tester wants to use its own (external) service instance VOMS_server - possible, but reproducibility not guaranteed • Multi-sites/across firewalls tests - possible (see Andy’s talk) • Is the test job different from the standard build submission - not from the WP2 point of view • Proposal of the YAML format for the dependencies definitions (see flow-spec.yaml) Bologna -- All Hands Meeting
flow-spec.yaml # First, a list of all jobs. --- - UI - CE - WMSLB - VOMS_mysql - RGMA # Now, mapping the job name to its script. --- UI: UI.sh CE: CE.sh WMSLB: WMSLB.sh VOMS_mysql: VOMS_mysql.sh RGMA: RGMA.sh # Now, a hash mapping each job to its dependencies at the nodeid discovery # stage. --- UI: [RGMA, VOMS_mysql, WMSLB, CE] CE: [WMSLB] WMSLB: [VOMS_mysql] VOMS_mysql: [RGMA] RGMA: [] # Timeouts for nodeid discovery stage. --- UI: 35 CE: 25 WMSLB: 15 VOMS_mysql: 10 RGMA: 0 Bologna -- All Hands Meeting
gLite/general issues • Do we adopt YAML format • Do we need to create a temporary CAs servers or we expect this from the testers/code submitters • pass-phrase problem • Do we write site-info.def file upfront or we take the assumption of the future auto-discovery Bologna -- All Hands Meeting
Next steps • Virtualisation using WoD (WindowsOnDemand) service • Initial assessment very positive • Customized installation a-la etics WN • Candidate for the “freeze” scenario - one can programmatically export/import the VM • Free as of today, paid in the future (should we run a dedicated/private server) • Virtualisation using the VMWare • Base installation (Alberto can say much more) • API existing • Virtualisation with Condor see Andy’s talk Bologna -- All Hands Meeting
Next steps • Demo for the PM12 (review) ? Bologna -- All Hands Meeting
Discussion • Q & A Bologna -- All Hands Meeting