140 likes | 271 Views
Watchdog: A job monitoring solution inside the EELA-2 Infrastructure. Riccardo Bruno, Roberto Barbera, Elisa Ingrà INFN Sez. Catania (Italy) 2nd EELA-2 Conference Choroni (Venezuela), 25-27.11.2009. Job Monitoring in gLite. WNs. Jobs. WN. ?. WMS. Output SandBox. CE. CE. CE.
E N D
Watchdog: A job monitoring solution inside the EELA-2 Infrastructure Riccardo Bruno, Roberto Barbera, Elisa Ingrà INFN Sez. Catania (Italy) 2nd EELA-2 Conference Choroni (Venezuela), 25-27.11.2009
Job Monitoring in gLite WNs Jobs WN ? WMS Output SandBox CE CE CE Before gLite v3.1 no job monitoring systems were available • Jobs running into the WNs are considered as Black Boxes • No prompted job status retrieval (Done/Abort/…) • Output Sandbox available only after WMS recognize job completion • This situation was not good for jobs requesting very long computational time. Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
Analysis • Need • Get in touch with the jobs running into the WN (especially for long term jobs) monitoring and controlling their execution. • How • Perform job control and monitoring using grid services in the less invasive way for the application. • Observations • Almost all Grid jobs are piloted by a main shell script: • Get precious info in case of faults • Pilot complex batch workflows • Both AMGA and SE+LFC can be used as a basic Grid Info System • lfc-* and lcg-* tools already available for Grid file management • mdcli AMGA command can be used by jobs on the WNs • cp command in case of shared file system on the WN • The latency of CLI tools is very low compared to long term jobs Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
Requirements • Monitor job execution timely watching files produced by the job while it executes on the WN • File snapshots will be reported on LFC+SE, AMGA servers or mounted shared FSs • It would be useful to configure the monitoring tool accordingly to the user needs • The monitoring tool will consist only of bash script files • Few shell environment variables can be used to configure the monitoring behavior • Control the job execution accessing directly on the WN • It is possible to send user commands on the WN • It is possible to change the monitoring while the Grid job runs Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
The Watchdog • The Watchdog consists of set of shell scripts to be included in the JDL InputSandbox and then called by the pilot script. • Watchdog features: • It starts in background before to run the Grid job • The watchdog runs as long as the main job • The monitoring process can be piloted until the pilot scripthas not finished • Easily configurable and customizable • The watchdog does not compromise the CPU power of the WN • The watchdog can be used with MPI jobs • Files may be fully or partially reported (only last changes) Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
WD Main Components • watchdog.sh • The WD core main script, it is the responsible of the job monitoring file snapshot reporting and user command execution • watchdog.ctrl • This script controls the execution of the WD core script; it can: start, stop, pause and resume the WD.It can be also used to: alter the time interval add/remove files to watch and change reporting strategy (full/partial) • watchdog.conf • This script contains all environment variables needed to configure the WD The use of AMGA reporting requires more files Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
WD Additional Components • getinfo.sh / setinfo.shgetcontent.sh / setcontent.sh (AMGA) • Utilities to set/get WD reported information from/to AMGA metadata catalog • uuencode / uudecode (shareutils) (AMGA) • Executables needed by WD to encode binaries and multiline text content into the AMGA metadata catalog in Base64 text format. • In EELA-2 (prod VO) available into: • $VO_PROD_VO_EU_EELA_EU_SW_DIR • wdcli • CLI application to let the user interact with the WD Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
WD Usage Configure the Watchdog setting the watchdog.conf file Applications using Watchdog MUST include the files: watchdog.sh, watchdog.ctrl, watchdog.conf,uuencode,uudecode (in case of AMGA reporting) or configure the PATH VO_PROD_VO_EU_EELA_EU_SW_DIR in the WN Call the watchdog.ctrl into the pilot script App JDL AppPiloyScript.sh Type = "Job"; JobType = "Normal"; Executable = "/bin/bash"; StdOutput = "file.out"; StdError = "file.err"; InputSandbox = {"watchdog.sh", "watchdog.ctrl", "watchdog.conf","uuencode", "uudecode", "AppPilotScript.sh"}; OutputSandbox = {"MyApp.out","MyApp.err", "watchdog.log”,"watchdog.err"}; Arguments = "AppPilotScript.sh"; • #!/bin/sh • … • # prepare and start the watchdog • PATH=${VO_PROD_VO_EU_EELA_EU_SW_DIR}\/:${PATH}:. • chmod +x watchdog.*./watchdog.ctrl start • #run application … • # Use the ./watchdog.ctrl • # to control the WD anytime • #stop and wait the watchdog completes • ./watchdog.ctrl stop Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
WD Interaction <BASEPATH>/6-tPC2d2knO7m6GP2XC7-Q _watchdog/ 091002232421_wdcli_cmd1.cmd 091002232421_wdcli_cmd1.err 091002232421_wdcli_cmd1.out ... 091002232729_wdcli_cmd7.cmd 091002232729_wdcli_cmd7.err 091002232729_wdcli_cmd7.out WDEND or WDPID WDENV WDHST cmdlist/ wdcli_cmd8 091002231841_13156_file.err 091002231853_13156_file.out 091002231904_13156_watchdog.err … 091002232836_13156_watchdog.log LFC/AMGA Mounted Sh FS WD Control DIR watchdog.conf 6-tPC2d2knO7m6GP2XC7-Q OUT ERR CMD watchdog.sh File snapshots Flags WD CMD Exe DIR WN Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
wdcli • CLI to ease the WD user interaction • 20091124164201 wd> • Uses the watchdog.conf file to get user configuration • Principal commands: • set Set MODE (LFC/AMGA/mounted Shared FS) • show jobs Get list of monitored jobs • Attach to a monitored job • show snapshots Get the list of file snapshots • View the snapshot content • Get generic info: ENV,PID,CE,WN,Proxy … • exec Execute a given command • Interactive commands are not allowed • It is possible to call the watchdog.ctrl command (use –n opt!) Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
WD in EELA-2 • Presented 1st time in E2GRIS1 at Itacuruca (Brazil) • G-HMMER/G-InterProScan • Bioinformatic – Get semi-real time info to be published on the WEB • CrossFire • Civil Protection – Get semi-real time info to view the simulation output • Presented the 2nd time in E2GRIS2 at Qeretaro (Mexico) • HeMoLab • Bioinformatic – Long run jobs, check output files while running • AeroVANT • Engineering – Long run jobs, get data while running • BioMD • Bioinformatic – Long run job, monitor the simulation • Seismic Sensors (planned to) • Earth Science – Monitor the job execution • Cinefilia • Recommender Systems – Monitor the computation Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
Conclusions • WD mainly used for: • Job monitoring (Long run) • Check/Get job produced data • WD used as: • As a Debugging helper tool • As an application component (CrossFire) • WD easy to integrate but needs a precise configuration • EELA-2 has 2 different AMGA server using different access rights (EU and LA) • EELA-2 does not have shareutils (uuencode/uudecode) package installed on the WNs. These tools available under WN path: VO_PROD_VO_EU_EELA_EU_SW_DIR or put ‘uu**code’ commands in the InputSandbox • EELA-2 several WNs were using a different BDII, some users were unable to retrieve easily the snapshot content (LFC) Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
Future • Improve the User Interaction • Improve wdcli (due to the good success in E2GRIS2) • Create tools to easily create web based front ends • Provide tools to reconstruct a file monitored incrementally • Ease the application integration (AMGA) • uuencode/uudecode independent • provide watchdog.conf file templates for VOs • Improve the Monitoring • Provide independent time watching cycles for each file • Provide a sandboxing mechanism for file I/O from/to WN Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009
Questions? www.eu-eela.eu www.eu-eela.eu 14