Developing Software for Large Hadron Collider Experiments

US CMS Testbed

Large Hadron Collider • Supercollider on French-Swiss border • Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA)

Detector / Experiment for LHC Search for Higgs Boson, other fundamental forces Compact Muon Solenoid

Still Under Development • Developing software to process enormous amount of data generated • For testing and prototyping, the detector is being simulated now • Simulating events (particle collisions) • We’re involved in the United States portion of the effort

Storage and Computational Requirements • Simulating and reconstructing millions of events per year, batches of around 150,000 (about 10 CPU months) • Each event requires about 3 minutes of processor time • A single run will generate about 300 GB of data

Before Condor-G and Globus • Runs are hand assigned to individual sites • Manpower intensive to organize run distribution and collect results • Each site has staff managing their runs • Manpower intensive to monitor jobs, CPU availability, disk space, etc.

Before Condor-G and Globus • Use existing tool (MCRunJob) to manage tasks • Not “Grid-Aware” • Expects reliable batch system

UW High Energy Physics: A special case • Was a site being assigned runs • Modified local configuration to flock to UW Computer Science Condor pool • When possible used standard universe to increase available computers • During one week used 30,000 CPU hours.

Our Goal • Move the work onto “the Grid” using Globus and Condor-G

Why the Grid? • Centralize management of simulation work • Reduce manpower at individual sites

Why Condor-G? • Monitors and manages tasks • Reliability in unreliable world

Lessons Learned • The grid will fail • Design for recovery

The Grid Will Fail • The grid is complex • The grid is new and untested • Often beta, alpha, or prototype. • The public Internet is out of your control • Remote sites are out of your control

The Grid is Complex • Our system has 16 layers • A minimal Globus/Condor-G system has 9 layers • Most layers stable and transparent • MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper > Impala wrapper > actual job

Design for Recovery • Provide recovery at multiple levels to minimize lost work • Be able to start a particular task over from scratch if necessary • Never assume that a particular step will succeed • Allocate lots of debugging time

Now • Single master site sends jobs to distributed worker sites. • Individual sites provide configured Globus node and batch system • 300+ CPUs across a dozen sites. • Condor-G acts as reliable batch system and Grid front end

How? MOP. • Monte Carlo Distributed Production System • Pretends to be local batch system for MCRunJob • Repackages jobs to run on a remote site

CMS Testbed Big Picture Master Site Worker MCRunJob Globus MOP Condor DAGMan Real Work Condor-G

DAGMan, Condor-G, Globus, Condor • DAGMan - Manages dependencies • Condor-G - Monitors the job on master site • Globus - Sends jobs to remote site • Condor - Manages job and computers at remote site

Automatically recovers from machine and network problems on execute cluster. Recovery: Condor

Automatically monitors for and retries a number of possibly transient errors. Recovers from down master site, down worker sites, down network. After a network outage can reconnect to still running jobs. Recovery: Condor-G

If a particular task fails permanently, notes it and allows easy retry. Can automatically retry, we don’t. Recovery: DAGMan

Globus software under rapid development Use old software and miss important updates Use new software and deal with version incompatibilities Globus

Our first run gave us two weeks to do about 10 days of work (given available CPUs at the time). We had problems Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures Fall of 2002: First Test

The system recovered automatically from many problems Relatively low human intervention Approximately one full time person It Worked!

Improved automatic recovery for more situations Generated 1.5 million events (about 30 CPU years) in just a few months Currently gearing up for even larger runs starting this summer Since Then

Expanding grid with more machines Use Condor-G’s scheduling capabilities to automatically assign jobs to sites Officially replace previous system this summer. Future Work

http://www.cs.wisc.edu/condor adesmet@cs.wisc.edu Thank You!

Developing Software for Large Hadron Collider Experiments