1 / 24

Condor at Brookhaven

Condor at Brookhaven. Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21. Outline. RACF background RACF condor batch system USATLAS grid job submission using condor-g. RACF. Brookhaven (BNL) is multi-disciplinary DOE lab.

baina
Download Presentation

Condor at Brookhaven

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21

  2. Outline • RACF background • RACF condor batch system • USATLAS grid job submission using condor-g

  3. RACF • Brookhaven (BNL) is multi-disciplinary DOE lab. • RHIC and ATLAS Computing Facility (RACF) provides computing support for BNL activities in HEP, NP, Astrophysics, etc. • RHIC Tier0 • USATLAS Tier1 • Large installation • 7000+ cpus, 5+ PB of storage, 6 robotic silos with capacity of 49,000+ tapes • Storage and computing to grow by a factor ~5 by 2012.

  4. New Data Center rising New data center will increase floor space by a factor ~2 in summer of 2009.

  5. BNL Condor Batch System • Introduced in 2003 to replace LSF. • Steep learning curve – much help from Condor staff. • Extremely successful implementation. • Complex use of job slots (formerly VM’s) to determine job priority (queues), eviction, suspension and back-filling policies.

  6. Condor Queues • Originally designed with vertical scalability • Complex queue priority configuration per core • Maintainable with old less core hardware • Changed to horizontal scalability in 2008 • More and more Multi-core hardware now • Simplified queue priority configuration per core • Reduce administrative overhead

  7. Condor Policy for ATLAS (old)

  8. ATLAS Condor configuration (old)

  9. Condor Policy @ BNL

  10. ATLAS Condor configuration (new)

  11. Condor Queue Usage

  12. Job Slot Occupancy (RACF) • Left-hand plot is for 01/2007 to 06/2007. • Right-hand plot is for 06/2007 to 05/2008. • Occupancy remained at 94% between the two periods.

  13. Job Statistics (2008) • Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since 2007. • PHENIX executed ~50% of its jobs in the general queue. • General queue jobs amounted to 37% of all RHIC Condor jobs during this period. • General queue efficiency increased from 87% to 94% since 2007.

  14. Near-Term Plans • Continue integration of Condor with Xen virtual systems. • OS upgrade to 64-bit SL5.x – any issues with Condor? • Condor upgrade from 6.8.5 to stable series 7.2.x • Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan.

  15. Condor-G Grid job submission PanDA Job Flow • BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system.

  16. One critical service is to maintain PanDA autopilot submission using Condor-G • Very large number (~15000) of current pilot jobs as a single user • Need to maintain very high submission rate • Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites • Three Condor-G submit hosts in production • Quad-core Intel Xeon E5430 @ 2.66GHz, 16G Memory and two 750GB SATA drives (mirrored disks)

  17. Weekly OSG Gratia Job Count Report for USATLAS VO • We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team.

  18. New Features and Tuning of Condor-G submission(not a complete list)

  19. Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources. $> condor_status -grid Name Job Limit Running Submit Limit In Progress gt2 atlas.bu.edu:211 2500 376 200 0 gt2 gridgk04.racf.bn 2500 1 200 0 gt2 heroatlas.fas.ha 2500 100 200 0 gt2 osgserv01.slac.s 2500 611 200 0 gt2 osgx0.hep.uiuc.e 2500 5 200 0 gt2 tier2-01.ochep.o 2500 191 200 0 gt2 uct2-grid6.mwt2. 2500 1153 200 0 gt2 uct3-edge7.uchic 2500 0 200 0

  20. Nonessential jobs • Condor assumes every job is important, it carefully holds and retries • Pile-up of held jobs often clogs condor-g, prevents it from submitting new jobs • A new job attribute , Nonessential, is introduced. • Nonessential jobs will be aborted instead of being put on hold. • Suited for “pilot” jobs • pilots are job sandbox, not real job payload. Pilots themselves are not as essential as real jobs. • Job payload connects to PanDA server through its own channel. PanDA server knows their status and can abort them directly if needed.

  21. GRID_MONITOR_DISABLE_TIME • New configurable condor-g parameter • Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job • Old default value of 60 minutes is too long • New job submission quiet often pauses during the wait time, job submission can not sustain at high rate level • New value is 5 minutes • Much better submission rate seen in production. • Condor-G developers have plan to trace the underneathGrid monitor failures, in Globus context

  22. Separate throttle for limiting jobmanagers based on their role • Job submission won’t compete with job stage_out/removal • Globus bug fix • GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates. • We ran cronjobto periodically kill GAHP server to clear up the connections issue. Slower job submission rate. • New condor-g binary compiles against newer Globuslibraries, so far so good. Need more time to verify.

  23. Some best practices in Condor-G submission • Reduce frequency of voms-proxy renewal on the submit host • Condor-G aggressively pushes out new proxies to all jobs • Frequent renewal of voms-proxy on the submit hosts slow down job submission • Avoid hard-kill jobs (-forcex) from client side • Reduces job debris on the remote gatekeepers • On the other hand, on the remote gatekeepers, we need to more aggressively clean up debris

  24. Near-Term Plans Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.

More Related