Condor Usage at Brookhaven National Lab

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005

About Brookhaven National Lab • One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE. • Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department. • Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects. • RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.

Computing Facility Resources • Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc. • 6+ PB permanent tape storage capacity. • 500+ TB central/distributed disk storage capacity. • 1.4 million SpecInt2000 aggregrate computing power in Linux Farm.

History of Condor at Brookhaven • First looked at Condor in 2003 as a replacement for LSF and in-house batch software. • Installed 6.4.7 in August 2003. • Upgraded to 6.6.0 in February 2004. • Upgraded to 6.6.6 (with 6.7.0 startd binary) in August 2004. • User base grew from 12 (April 2004) to 50+ (March 2005).

The Rise in Condor Usage

Condor Cluster Usage

BNL’s modified Condorview

Overview of Computing Resources • Total of 2750 CPUs (growing to 3400+ in 2005). • Two central managers with one acting as a backup. • Three specialized submit machines which handle ~600 simultaneous jobs each on average. • 131 of the execute nodes can also act as submission nodes. • One monitoring/Condorview server.

Overview of Computing Resources, cont. • Six GLOBUS gateway machines for remote job submission. • Most machines run SL-3.0.2 on the x86 platform, some still using RH 7.3. • Running 6.6.6 with 6.7.0 startd binary to take advantage of multiple VM feature.

Overview of Configuration • Computing resources divided into 6 pools. • Two configuration models: • Split pool resources into two parts and restrict which jobs can run in each part. • More complex version of the Bologna Batch System. • A pool uses one or both of these models. • Some pools employ user priority preemption. • Use “drop queue” method to fill fast machines first. • Have tools to easily reconfigure nodes. • All jobs use vanilla universe (no checkpointing).

Two Part Model • Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction. • Within Condor, a node advertises itself as either an analysis node or a reconstruction node. • A job must advertise itself in the same manner to match with an appropriate node. • Only certain users may run reconstruction jobs but anyone can run an analysis job.

Analysis/Reconstruction Group 5 Group 3 vm1 Fast vm2 Group 4 Group 2 Group 3 • No suspension • No preemption • Will start a job if CPU is free Group 2 Group 1 Slow Group 1 Reconstruction Job: wants group <= 2

A More Complex Version of the Bologna Model • Two CPU nodes each with 8 VMs. • 2 VMs per CPU. • Only two jobs running at a time. • Four job categories, each with its own priority. • A high priority VM will suspend a random VM of lower priority. • The random aspect is to prevent the same VM from always getting suspended.

High (vm7/vm8) Analysis/Reconstruction High Prio Group 5 Group 3 Med (vm5/vm6) Fast Low (vm3/vm4) Group 4 Low Prio MC (vm1/vm2) Group 2 Group 3 • Low priority VMs suspended • No preemption • Will start a job if CPU is free or is of higher priority Group 2 Group 1 Slow Group 1 Reconstruction Job: wants group == 3 Med. Priority (vm5/vm6)

Issues We've Had to Deal With • Tune parameters to alleviate scalability problems. • MATCH_TIMEOUT • MAX_CLAIM_ALIVES_MISSED • Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C Panasas fixed bug. • High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.

Issues We’ve Had to Deal With, cont. • Dagman problems (latency, termination)  changed from dagman for plain Condor. • Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off). • Modified Condorview to meet our accounting & monitoring requirements.

Issues Not Yet Resolved • Need job ClassAd which gives user's primary group --> better control over cluster usage. • Transfer output files for debugging when job is evicted. • Need option to force the schedd to release its claim after each job. • Allow schedd to set mandatory periodic_remove policy  avoid manual cleanup.

Issues Not Yet Resolved, cont. • Shadow seems to make a large number of NIS calls. Possible problem with caching  address shadows in vanilla universe? • Need Kerberos support to comply with security mandates. • Interested in Condor on Demand (COD), but lack of functionality prevents more usage. • Need more (and effective)cluster management tools  condor_off works?

Near-Term Plans & Summary • Waiting for 6.8.x series (late 2005?) to upgrade. • Scalability concerns as usage rises. • High availability more critical as usage rises. • Integration of BNL Condor pools with external pools, but concerned about security. • Need some functionalities listed above for a meaningful upgrade and to improve cluster management capability.

Condor Usage at Brookhaven National Lab