Preliminary tests with co-scheduling and the Condor parallel universe

Preliminary tests with co-scheduling and the Condor parallel universe Marian ZUREK for WP2 ETICS All Hands meeting Bologna, October 23-25, 2006

What’s the … • Context • Use case • Past • Condor / NMI setup • Results • gLite-specific issues • Next steps • Discussion Bologna -- All Hands Meeting

Context • The gLite software stacks testing activity is highly manual, so the motivation came for the process automation and ease of reproducibility • In the future the system tests should become the part of the release process (reports stored in the DB, easily accessible for the trends creation, performance analysis, bug reproduction, etc.) Bologna -- All Hands Meeting

Web Application NMI Scheduler Web Service Service Overview Via browser Build/Test Artefacts Report DB Project DB Command- Line tools Clients NMI Client Wrapper WNs ETICS Infrastructure Continuous Builds Bologna -- All Hands Meeting

Use case for gLite • We have to deploy six services on six different nodes UI, CE, WMSLB, VOMS_mysql, RGMA, WN • There are interdependencies between them • UI: [RGMA, VOMS_mysql, WMSLB, CE, WN] • CE: [WMSLB, WN] • WMSLB: [VOMS_mysql] • VOMS_mysql: [RGMA] • WN: [CE] • RGMA: [] • No auto discovery possible, order of service startup must be preserved, run-time environment defined • The successful “real job” submission requires all the services being operational Bologna -- All Hands Meeting

Back to the past • The gLite software stack requires the administation rights on the target node, so the root-enabled schema has been developed to address this • root enabled jobs should be performed only on the predefined sets of hosts • The service installation should reflect its operational status by writing to the file e.g. /etc/nmi/publish_services.list • runs_VOMS_server="true", timeOut=3600 • runs_RGMA_server="true” • The timeOut (expressed in seconds) defines the service operational time. After the timeOut node will be released. Absence of the timeOut will mean that the machine is released immediately after the job has been finished. Bologna -- All Hands Meeting

Condor / NMI setup • Experiment on the predefined set of nodes • special STARTD expressions for defining the Condor VMx availability • nodes still available for the regular submissions • synchronisation using Condor-chirp messages • Custom (outside the NMI/Condor) scratching mechanism: • watchdog style (outside process monitoring the node’s “limbo” state) • Initial trouble with lost/stuck jobs resolved with extra wait time • Node down-time < 10mins • Very good candidate for the virtualisation as no re-installation is needed (simple VM restart is enough) Bologna -- All Hands Meeting

Results • Using the NMI and Condor parallel universe we were able to address the above described scenario • The delays were minimal and experimental timeOuts adjusted for optimal performance • The developed code could be consulted in the CVS, module: org.etics.nmi.system-tests • Non-conditonal persistency: The node on which the service runs remains operational for the predefined set of time • Sleep appended to the code • expiry-time communication via NMI/Hawkeye module Bologna -- All Hands Meeting

Results • Conditional persistency : the node should be frozen in case the job fails (not implemented yet, but easy). • Failure propagation: should one of the parallel tests fail - the whole job flow is immediately aborted • Set of parallel job nodes exits immediately when node_0 job exits (let the node_0 be the “last” in the chain) • Output format definition is up to the submitter Bologna -- All Hands Meeting

Results • Context, name spaces - assured thanks to the Condor/NMI design • Tester wants to use its own (external) service instance VOMS_server - possible, but reproducibility not guaranteed • Multi-sites/across firewalls tests - possible (see Andy’s talk) • Is the test job different from the standard build submission - not from the WP2 point of view • Proposal of the YAML format for the dependencies definitions (see flow-spec.yaml) Bologna -- All Hands Meeting

flow-spec.yaml # First, a list of all jobs. --- - UI - CE - WMSLB - VOMS_mysql - RGMA # Now, mapping the job name to its script. --- UI: UI.sh CE: CE.sh WMSLB: WMSLB.sh VOMS_mysql: VOMS_mysql.sh RGMA: RGMA.sh # Now, a hash mapping each job to its dependencies at the nodeid discovery # stage. --- UI: [RGMA, VOMS_mysql, WMSLB, CE] CE: [WMSLB] WMSLB: [VOMS_mysql] VOMS_mysql: [RGMA] RGMA: [] # Timeouts for nodeid discovery stage. --- UI: 35 CE: 25 WMSLB: 15 VOMS_mysql: 10 RGMA: 0 Bologna -- All Hands Meeting

gLite/general issues • Do we adopt YAML format • Do we need to create a temporary CAs servers or we expect this from the testers/code submitters • pass-phrase problem • Do we write site-info.def file upfront or we take the assumption of the future auto-discovery Bologna -- All Hands Meeting

Next steps • Virtualisation using WoD (WindowsOnDemand) service • Initial assessment very positive • Customized installation a-la etics WN • Candidate for the “freeze” scenario - one can programmatically export/import the VM • Free as of today, paid in the future (should we run a dedicated/private server) • Virtualisation using the VMWare • Base installation (Alberto can say much more) • API existing • Virtualisation with Condor see Andy’s talk Bologna -- All Hands Meeting

Next steps • Demo for the PM12 (review) ? Bologna -- All Hands Meeting

Discussion • Q & A Bologna -- All Hands Meeting

Preliminary tests with co-scheduling and the Condor parallel universe

Preliminary tests with co-scheduling and the Condor parallel universe

Presentation Transcript

Using the Parallel Universe beyond MPI

Windows and Condor: Co-Existence and Interoperation

Reliability and Troubleshooting with Condor

Farming with Condor

Parallel Application Memory Scheduling

Preliminary Coexistence Tests

int.eu.grid: Experiences with Condor to Run Interactive and Parallel Applications on the Grid

Scheduling of parallel processes

Virtual Machine Universe in Condor

Scheduling Generic Parallel Applications –Meta-scheduling

Parallel Machine Scheduling

Parallel Universe

Parallel Job Scheduling Algorithms and Interfaces

Scheduling Mixed Parallel Applications with Reservations

Scheduling on Parallel Systems

Condor and Multi-core Scheduling

Various Tests with CNI Data All Preliminary

Scheduling on Parallel Systems

Farming with Condor

Virtual Machine Universe in Condor

Reliability and Troubleshooting with Condor

Using the Parallel Universe beyond MPI