Condor at Brookhaven

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21

Outline • RACF background • RACF condor batch system • USATLAS grid job submission using condor-g

RACF • Brookhaven (BNL) is multi-disciplinary DOE lab. • RHIC and ATLAS Computing Facility (RACF) provides computing support for BNL activities in HEP, NP, Astrophysics, etc. • RHIC Tier0 • USATLAS Tier1 • Large installation • 7000+ cpus, 5+ PB of storage, 6 robotic silos with capacity of 49,000+ tapes • Storage and computing to grow by a factor ~5 by 2012.

New Data Center rising New data center will increase floor space by a factor ~2 in summer of 2009.

BNL Condor Batch System • Introduced in 2003 to replace LSF. • Steep learning curve – much help from Condor staff. • Extremely successful implementation. • Complex use of job slots (formerly VM’s) to determine job priority (queues), eviction, suspension and back-filling policies.

Condor Queues • Originally designed with vertical scalability • Complex queue priority configuration per core • Maintainable with old less core hardware • Changed to horizontal scalability in 2008 • More and more Multi-core hardware now • Simplified queue priority configuration per core • Reduce administrative overhead

Condor Policy for ATLAS (old)

ATLAS Condor configuration (old)

Condor Policy @ BNL

ATLAS Condor configuration (new)

Condor Queue Usage

Job Slot Occupancy (RACF) • Left-hand plot is for 01/2007 to 06/2007. • Right-hand plot is for 06/2007 to 05/2008. • Occupancy remained at 94% between the two periods.

Job Statistics (2008) • Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since 2007. • PHENIX executed ~50% of its jobs in the general queue. • General queue jobs amounted to 37% of all RHIC Condor jobs during this period. • General queue efficiency increased from 87% to 94% since 2007.

Near-Term Plans • Continue integration of Condor with Xen virtual systems. • OS upgrade to 64-bit SL5.x – any issues with Condor? • Condor upgrade from 6.8.5 to stable series 7.2.x • Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan.

Condor-G Grid job submission PanDA Job Flow • BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system.

One critical service is to maintain PanDA autopilot submission using Condor-G • Very large number (~15000) of current pilot jobs as a single user • Need to maintain very high submission rate • Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites • Three Condor-G submit hosts in production • Quad-core Intel Xeon E5430 @ 2.66GHz, 16G Memory and two 750GB SATA drives (mirrored disks)

Weekly OSG Gratia Job Count Report for USATLAS VO • We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team.

New Features and Tuning of Condor-G submission(not a complete list)

Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources. $> condor_status -grid Name Job Limit Running Submit Limit In Progress gt2 atlas.bu.edu:211 2500 376 200 0 gt2 gridgk04.racf.bn 2500 1 200 0 gt2 heroatlas.fas.ha 2500 100 200 0 gt2 osgserv01.slac.s 2500 611 200 0 gt2 osgx0.hep.uiuc.e 2500 5 200 0 gt2 tier2-01.ochep.o 2500 191 200 0 gt2 uct2-grid6.mwt2. 2500 1153 200 0 gt2 uct3-edge7.uchic 2500 0 200 0

Nonessential jobs • Condor assumes every job is important, it carefully holds and retries • Pile-up of held jobs often clogs condor-g, prevents it from submitting new jobs • A new job attribute , Nonessential, is introduced. • Nonessential jobs will be aborted instead of being put on hold. • Suited for “pilot” jobs • pilots are job sandbox, not real job payload. Pilots themselves are not as essential as real jobs. • Job payload connects to PanDA server through its own channel. PanDA server knows their status and can abort them directly if needed.

GRID_MONITOR_DISABLE_TIME • New configurable condor-g parameter • Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job • Old default value of 60 minutes is too long • New job submission quiet often pauses during the wait time, job submission can not sustain at high rate level • New value is 5 minutes • Much better submission rate seen in production. • Condor-G developers have plan to trace the underneathGrid monitor failures, in Globus context

Separate throttle for limiting jobmanagers based on their role • Job submission won’t compete with job stage_out/removal • Globus bug fix • GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates. • We ran cronjobto periodically kill GAHP server to clear up the connections issue. Slower job submission rate. • New condor-g binary compiles against newer Globuslibraries, so far so good. Need more time to verify.

Some best practices in Condor-G submission • Reduce frequency of voms-proxy renewal on the submit host • Condor-G aggressively pushes out new proxies to all jobs • Frequent renewal of voms-proxy on the submit hosts slow down job submission • Avoid hard-kill jobs (-forcex) from client side • Reduces job debris on the remote gatekeepers • On the other hand, on the remote gatekeepers, we need to more aggressively clean up debris

Near-Term Plans Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.

Condor at Brookhaven