1 / 12

Direct gLExec integration with PanDA

Direct gLExec integration with PanDA. Fernando H. Barreiro Megino Simone Campana Ramon Medrano (CERN IT-ES-VOS). Introduction. The WLCG Grid job and Worker Node Security Assessment ( http://cern.ch/go/7vK9 ) requires the usage of glExec gLExec acts as a light-weight 'gatekeeper ’

kingashley
Download Presentation

Direct gLExec integration with PanDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Direct gLExec integration with PanDA Fernando H. Barreiro Megino Simone Campana Ramon Medrano (CERN IT-ES-VOS) Fernando H. Barreiro Megino

  2. Introduction • The WLCG Grid job and Worker Node Security Assessment (http://cern.ch/go/7vK9) requires the usage of glExec • gLExec acts as a light-weight 'gatekeeper’ • Take grid credentials as input • Consider local site policies to authenticate and authorize the credentials • Switch to new identity and execution sandbox to run a given command • gLExec usage comes for free with GlideInWMS, but alternatively we can integrate directly gLExec with PanDA Pilot Fernando H. Barreiro Megino

  3. Common Analysis Framework Client side Server side Grid resources Data Mgmt Services Job trans Job trans (Optional) Client Service Data Adaptor PanDA pilot PanDA pilot PanDA Server … VO-specific client Computing Element glexec PanDA Pilot Factories PanDA monitor and Dashboard Historical views glideIns PanDA components GlideIn WMS VO specific, external components GlideInWMS components glexec Fernando H. Barreiro Megino

  4. Direct gLExec integration: Stage 1 MyProxy User proxy PanDA pilot User proxy gLExec Job PanDA pilot gLExec Job Job Job PanDA Server … Computing Element Implementation steps MyProxy & gLExec standalone tests Refactoring of MyProxyUtils gLExec SAM test Integration into PanDA pilot Fernando H. Barreiro Megino

  5. 1. MyProxy & gLExec standalone tests  • Gain experience in using MyProxy and gLExec • Upload & download proxies and test delegation between Fernando and Ramon • Test gLExec through bsub on lxbatch(i.e. no pilot involved yet). Switching identity from Fernando to Ramon • Run id, voms-proxy-info … • Copy a file in&out from EOS SCRATCHDISK Fernando H. Barreiro Megino

  6. 2. Refactoring of MyProxyUtils • Few years ago JoseCaballero wrote MyProxyUtils • python wrapper to MyProxy and gLExec • some set-up for PanDA pilot • Re-factored the library: • Included usage of gLExec tools not available back in the days • Environment wrap/unwrap scripts(http://cern.ch/go/h6Qz) • Creation of secure sandbox through mkgltempdir(http://cern.ch/go/s6pt) • Reviewed the logic (e.g. removed chmod 777 of pilot directory before identity switching) • Effort in imposing coding standards: Side-objective of our activity is to help Paul Nilsson improve the overall pilot code • Validated MyProxyUtilsby repeating previous standalone tests • Problems setting target-directory mode in mkgltempdir(to be followed up) • Ulrich had to fix the installation of the gLExec tools and myproxy-logon on part of the lxbatch WNs • Are we the first people to use the auxiliary gLExec tools? pylint check Fernando H. Barreiro Megino

  7. 3. gLExec SAM test • Existing gLExec SAM test is not very complete • Switch from one proxy to itself and do nothing as target user • We would like to extend the test • Use MyProxy to download a different target credential • Check that the gLExec tools are installed in every site • EGI sites should have them (Maarten Litmaath) • OSG sites do not have mkgltempdir (Dave Dykstra, Igor Sfiligoi) • Alternatively the script could be shipped with pilot • Check that the full gLExec workflow succeeds in every site • Alessandro pointed us to the SAM gLExec code and Ramon will implement the changes Fernando H. Barreiro Megino

  8. 4.Integration with PanDA Pilot Once user job is downloaded, switch identity ASAP Collect local info Signal handler Abort and clean up Job recovery Additional cleanup Collect local info Check proxy Signal handler Check local space Abort and clean up Get job Pilot Fork sub process Setup job Transfer input Check log size Multi-job loop Execute payload Monitoring loop (Check workDir) Check local space Looping check Transfer output gLExec Clean up Fernando H. Barreiro Megino

  9. 4.Integration with PanDA Pilot • This is the hardest part of Stage 1 model implementation • Guidance of Paul Nilsson is important • Improving the pilot’s long-term sustainability • Getting familiar with the pilot code • Re-factoring the code for easier maintenance • Taking out the main pilot module (~4500 lines) is a lot of work • Splitting main pilot module (pilot.py) into two modules • Moving functions/classes between the two modules or to utility library when needed • Fixing warnings and errors that arise from separation • Modularize the code • Need to serialize/de-serialize python environment (variables, constants…) in order to share through gLExec • Ramon wrote a configuration manager • Dictionary-like way to store configuration values • In-depth serializable (json and pickle depending on available libraries) • Thread/multi-process safe • Solve permission problems that will arise from running parts of the pilot under two different users and sandboxes (e.g. merging log files) • Testing will not be trivial • Still a lot of work to be done Fernando H. Barreiro Megino

  10. Drawbacks of Stage 1 model • Users have to manage their proxy on MyProxy • What if it expires? • ATLAS uses a large number of pilot certificates • Users would have to delegate to all pilot certificates • Worker nodes are hammering MyProxy server • John Hover tested successfully MyProxy at ~25Hz (March 2012), this means over 2M accesses should be possible per day. Test conditions currently unknown • MyProxy server as single point of failure • These shortcomings could be solved by the model proposed for Stage 2 • Proposal. Not approved yet Fernando H. Barreiro Megino

  11. Stage 2: PanDA server caching and client integration Disclaimer: This model is only a proposal and has not been discussed or presented for approval MyProxy 2. User proxy PanDA pilot PanDA client glExec 1. User proxy and job Job 3. User proxy PanDA pilot glExec Job 2. Job … 4. User proxy and job 5. Notification: Proxy about to expire PanDA Server Proxy cache Computing Element Fernando H. Barreiro Megino

  12. Questions? Fernando H. Barreiro Megino

More Related