Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL

Background • Brookhaven National Laboratory (BNL) is a multi-disciplinary research laboratory funded by US government. • BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments – Brahms, Phenix, Phobos and Star. • The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.

Background (cont.) • The data collected by Rhic experiments is written to tapes in HPSS mass storage facility, to be reprocessed at a later time. • In order to automate the process of data reconstruction a home-grown batch system (“Old CRS”) was developed. • “Old CRS” manages data staging from and to HPSS system, schedules jobs and monitors their execution • Due to the growth of the size of the farm, the “old CRS” system does not scale well and needs to be replaced.

RCF facility Server racks HPSS storage

The RCF Farm Batch Systems - present Reconstruction farm (CRS system) Data Analysis farm (LSF batch system) Berlin wall

New CRS - requirements • Stage input files from HPSS, Unix (and in the future – Grid) • Stage output files to HPSS, Unix (and in the future – Grid) • Capable of running mass production, high degree of automation • Error diagnostics • Bookkeeping

Condor as the scheduler of new CRS system • Condor comes with Dagman – a meta scheduler which allows to build “graphs” of interdependent batch jobs. • Dagman allows to construct jobs which consist of several subjobs, which perform data staging operations and data reconstruction separately … • … which in turn allows us to optimize the staging of data tapes to minimize the number of tape mounts

New CRS batch system CRS job HPSS interface HPSS MySQL Logbook server

Anatomy of a CRS job …….. Parent job (1 per input file) Parent job (1 per input file) Parent job (1 per input file) Main job Each CRS job consist of several subjobs. Parent jobs (one per each input file) are responsible for locating input Data and – if necessary – staging them from tapes to HPSS disk cache. The main job is responsible for actual data reconstruction and is executed if and only if all parent jobs completed succesfully.

Lifecycle of the Main Job Did all parent jobs completed successfully? yes no Import input files from NFS or HPSS to local disk Perform error diagnostics and recovery. Update job databases. Run user’s executable Check exit code, check If all required data files are present At all stages of execution keep track of the production status and update the job/files databases. Export output files

Lifecycle of the HPSS interface subjob Check if HPSS is available yes no Submit “stage file” request no Notify the system about the error. Perform error diagnostics. (For example: can resubmitting the request help?) yes Stage successful? Notify CRS system Update databases

Logbook manager • Each job should provide own logfile, for monitoring and debugging purposes. • It is easy to keep a logfile of an individual job running on a single machine • CRS jobs consist of several subjobs which run independently, on different machines at different times. • In order to synchronize the record keeping a dedicated logbook manager is needed which is responsible for compiling reports from individual subjobs into one, human readable, logfile.

CRS Database • In order to keep track of the production CRS is interfaced to MySQL database • The database stores information about each job and subjob status, status of data files, HPSS staging requests, open data transfer links and statistics of completed jobs.

CRS control panel

The RCF Farm – future. Reconstruction farm (Condor system) Analysis farm (Condor system)

General Remarks

Condor as batch system for HEP • Condor has several nice features which make it very useful as batch system (for example DAG) • Submitting very large number of condor jobs (o(10k)) from a single node can put a very high load on submit machine, leading to potential condor meltdown. • This is not a problem when many users submit jobs from many machines, but for centralized services (data reconstruction, MC production,…) this can become a very difficult issue.

Status of condor – cntd. • People from Condor team were extremely helpful with solving Condor problems – when they were on site (in BNL). • Remote support (by e-mail) is slow.

Management barrier • So far HEP experiments were relatively small (approx 500 people) by business standards. • Everybody knows everybody • This is going to change – next generation of experiments will have thousands of physicists

Management barrier – cntd. • In such a big communities it will become increasingly hard to introduce new software products. • Users have (understandable) tendency to be conservative and they do not want any changes in the environment in which they work • Convincing users to switch to a new product and aiding them in transition will become a much more complex task than inventing this product!

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory