1 / 14

Watchdog: A job monitoring solution inside the EELA-2 Infrastructure

Watchdog: A job monitoring solution inside the EELA-2 Infrastructure. Riccardo Bruno, Roberto Barbera, Elisa Ingrà INFN Sez. Catania (Italy) 2nd EELA-2 Conference Choroni (Venezuela), 25-27.11.2009. Job Monitoring in gLite. WNs. Jobs. WN. ?. WMS. Output SandBox. CE. CE. CE.

jon
Download Presentation

Watchdog: A job monitoring solution inside the EELA-2 Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Watchdog: A job monitoring solution inside the EELA-2 Infrastructure Riccardo Bruno, Roberto Barbera, Elisa Ingrà INFN Sez. Catania (Italy) 2nd EELA-2 Conference Choroni (Venezuela), 25-27.11.2009

  2. Job Monitoring in gLite WNs Jobs WN ? WMS Output SandBox CE CE CE Before gLite v3.1 no job monitoring systems were available • Jobs running into the WNs are considered as Black Boxes • No prompted job status retrieval (Done/Abort/…) • Output Sandbox available only after WMS recognize job completion • This situation was not good for jobs requesting very long computational time. Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  3. Analysis • Need • Get in touch with the jobs running into the WN (especially for long term jobs) monitoring and controlling their execution. • How • Perform job control and monitoring using grid services in the less invasive way for the application. • Observations • Almost all Grid jobs are piloted by a main shell script: • Get precious info in case of faults • Pilot complex batch workflows • Both AMGA and SE+LFC can be used as a basic Grid Info System • lfc-* and lcg-* tools already available for Grid file management • mdcli AMGA command can be used by jobs on the WNs • cp command in case of shared file system on the WN • The latency of CLI tools is very low compared to long term jobs Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  4. Requirements • Monitor job execution timely watching files produced by the job while it executes on the WN • File snapshots will be reported on LFC+SE, AMGA servers or mounted shared FSs • It would be useful to configure the monitoring tool accordingly to the user needs • The monitoring tool will consist only of bash script files • Few shell environment variables can be used to configure the monitoring behavior • Control the job execution accessing directly on the WN • It is possible to send user commands on the WN • It is possible to change the monitoring while the Grid job runs Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  5. The Watchdog • The Watchdog consists of set of shell scripts to be included in the JDL InputSandbox and then called by the pilot script. • Watchdog features: • It starts in background before to run the Grid job • The watchdog runs as long as the main job • The monitoring process can be piloted until the pilot scripthas not finished • Easily configurable and customizable • The watchdog does not compromise the CPU power of the WN • The watchdog can be used with MPI jobs • Files may be fully or partially reported (only last changes) Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  6. WD Main Components • watchdog.sh • The WD core main script, it is the responsible of the job monitoring file snapshot reporting and user command execution • watchdog.ctrl • This script controls the execution of the WD core script; it can: start, stop, pause and resume the WD.It can be also used to: alter the time interval add/remove files to watch and change reporting strategy (full/partial) • watchdog.conf • This script contains all environment variables needed to configure the WD The use of AMGA reporting requires more files Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  7. WD Additional Components • getinfo.sh / setinfo.shgetcontent.sh / setcontent.sh (AMGA) • Utilities to set/get WD reported information from/to AMGA metadata catalog • uuencode / uudecode (shareutils) (AMGA) • Executables needed by WD to encode binaries and multiline text content into the AMGA metadata catalog in Base64 text format. • In EELA-2 (prod VO) available into: • $VO_PROD_VO_EU_EELA_EU_SW_DIR • wdcli • CLI application to let the user interact with the WD Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  8. WD Usage Configure the Watchdog setting the watchdog.conf file Applications using Watchdog MUST include the files: watchdog.sh, watchdog.ctrl, watchdog.conf,uuencode,uudecode (in case of AMGA reporting) or configure the PATH VO_PROD_VO_EU_EELA_EU_SW_DIR in the WN Call the watchdog.ctrl into the pilot script App JDL AppPiloyScript.sh Type = "Job"; JobType = "Normal"; Executable = "/bin/bash"; StdOutput = "file.out"; StdError = "file.err"; InputSandbox = {"watchdog.sh", "watchdog.ctrl", "watchdog.conf","uuencode", "uudecode", "AppPilotScript.sh"}; OutputSandbox = {"MyApp.out","MyApp.err", "watchdog.log”,"watchdog.err"}; Arguments = "AppPilotScript.sh"; • #!/bin/sh • … • # prepare and start the watchdog • PATH=${VO_PROD_VO_EU_EELA_EU_SW_DIR}\/:${PATH}:. • chmod +x watchdog.*./watchdog.ctrl start • #run application … • # Use the ./watchdog.ctrl • # to control the WD anytime • #stop and wait the watchdog completes • ./watchdog.ctrl stop Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  9. WD Interaction <BASEPATH>/6-tPC2d2knO7m6GP2XC7-Q _watchdog/ 091002232421_wdcli_cmd1.cmd 091002232421_wdcli_cmd1.err 091002232421_wdcli_cmd1.out ... 091002232729_wdcli_cmd7.cmd 091002232729_wdcli_cmd7.err 091002232729_wdcli_cmd7.out WDEND or WDPID WDENV WDHST cmdlist/ wdcli_cmd8 091002231841_13156_file.err 091002231853_13156_file.out 091002231904_13156_watchdog.err … 091002232836_13156_watchdog.log LFC/AMGA Mounted Sh FS WD Control DIR watchdog.conf 6-tPC2d2knO7m6GP2XC7-Q OUT ERR CMD watchdog.sh File snapshots Flags WD CMD Exe DIR WN Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  10. wdcli • CLI to ease the WD user interaction • 20091124164201 wd> • Uses the watchdog.conf file to get user configuration • Principal commands: • set Set MODE (LFC/AMGA/mounted Shared FS) • show jobs Get list of monitored jobs • Attach to a monitored job • show snapshots Get the list of file snapshots • View the snapshot content • Get generic info: ENV,PID,CE,WN,Proxy … • exec Execute a given command • Interactive commands are not allowed • It is possible to call the watchdog.ctrl command (use –n opt!) Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  11. WD in EELA-2 • Presented 1st time in E2GRIS1 at Itacuruca (Brazil) • G-HMMER/G-InterProScan • Bioinformatic – Get semi-real time info to be published on the WEB • CrossFire • Civil Protection – Get semi-real time info to view the simulation output • Presented the 2nd time in E2GRIS2 at Qeretaro (Mexico) • HeMoLab • Bioinformatic – Long run jobs, check output files while running • AeroVANT • Engineering – Long run jobs, get data while running • BioMD • Bioinformatic – Long run job, monitor the simulation • Seismic Sensors (planned to) • Earth Science – Monitor the job execution • Cinefilia • Recommender Systems – Monitor the computation Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  12. Conclusions • WD mainly used for: • Job monitoring (Long run) • Check/Get job produced data • WD used as: • As a Debugging helper tool • As an application component (CrossFire) • WD easy to integrate but needs a precise configuration • EELA-2 has 2 different AMGA server using different access rights (EU and LA) • EELA-2 does not have shareutils (uuencode/uudecode) package installed on the WNs. These tools available under WN path: VO_PROD_VO_EU_EELA_EU_SW_DIR or put ‘uu**code’ commands in the InputSandbox • EELA-2 several WNs were using a different BDII, some users were unable to retrieve easily the snapshot content (LFC) Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  13. Future • Improve the User Interaction • Improve wdcli (due to the good success in E2GRIS2) • Create tools to easily create web based front ends • Provide tools to reconstruct a file monitored incrementally • Ease the application integration (AMGA) • uuencode/uudecode independent • provide watchdog.conf file templates for VOs • Improve the Monitoring • Provide independent time watching cycles for each file • Provide a sandboxing mechanism for file I/O from/to WN Choroni (Venezuela), 2nd EELA-2 Conference, 25-27.11.2009

  14. Questions? www.eu-eela.eu www.eu-eela.eu 14

More Related