The First CMS Data Challenge (~1998/99)

The First CMS Data Challenge (~1998/99) Using Condor

Disclaimer • Official presentations of those activities are no more available… • Long time ago • Used machines already dismissed since time • Files lost in the dismissed disks • Only fragment of information still around • Mosltly on “printed” slides and unlinked Web pages • And … my memory is not as it was at that time … • However I could find some information and, without surprise, a number of “well known” names • The list of them will certainly forget somebody, so I’ll avoid to do it, • but PAOLO MAZZANTI is worth to be mentioned !

Environment and History • CMS Simulation program (CMSIM) using Geant3 (Fortran) • Different versions in rapid development • Objectivity at that time for CMS • First CMS reconstruction programs using C++ • SUN OS and HP Unix were the CMS basic operating systems • But Linux rapidly growing • And we had a legacy of a lot of Digital Alphas from LEP • ~Year 2000 INFN started to fund PCs Farms • In 1999 INFN launched the INFN-Grid project • The MONARC project was running at CERN Then … we were flooded by GRID and Tiers

The Data Challenge start … • From the minutes of a meeting of 14 May 1998: • Need to generate 360.000 events of single muon (3 different momenta) and 30.000 events of Higgs -> 2 muons (3 different masses) • To be done over Condor, starting June 98 • CMSIM code has been ported from SUN to Alpha: needs to be “linked” with Condor libraries • Local running of tests on Higgs simu gave ~1.4 min/event on both Alpha and SUN (with ~ 5 min of program initialization): >700 hours of CPU time for that sample of events • From another meeting of 13 May 1998: • Planning the National (INFN) Condor pool (~57 machine available) • CMSIM is one of the possible applications over WAN • GARFIELD (electric field simulation of the CMS Muon Detectors, DC cells) will run only locally (checkpoint file too big! Less than a today mail attachment …)

The challenge … before starting • We (CMS Bologna) were already using Condor!

3 Method Used to Producethe Drift Times • Full simulation on ALPHA machines ; • Bologna Condor facility used ; • Four tracks for each x, , B considered . •  • For each track we assumed the drift time is given by : • 50% one electron • 40% two electrons • 10% three electrons 9/12/1998 Report P. Capiluppi

5 • Drift Lines • when • Bw = 0.3 T 9/12/1998 Report P. Capiluppi

And we did started • A strange (to my mind) CMS Simulation statement (dated 20 Apr 1998) • The objective was to measure the throughput (in terms of CMS simulated events per hour) of the our Condor Pool … At the beginning we had some compatibility problem between the CERN Library and the Condor libraries, but the Condor Team promptly solved these problems. This has to be stressed again: the support from the Condor team is very good! • Indeed we (CMS Italy) started in that period to support (in concrete, even if small contribution) the Condor team • The number of machines running simulation under Condor was from 9 to 19! The 40% of jobs were checkpointed (we note that in the CMS case the checkpoint file was of the order of 66 MB!).

The real challenge (1/2) • CMSIM jobs were mostly CPU intensive • Very small I/O, compared to the CPU time required by the simulation of the number of events/job (carefully chosen) • Executable of the order of 140 Mbytes • Some of the Simulation programs required the access to input data (via RPC, not NFS, even in the “local” environment of Bologna). • Small in size in any case: ~130 KBytes/event read, same amount write • Some of the jobs had a larger I/O: ~600 KBytes/event • Propagation of the random seed for Simulation among the jobs • Required a careful bookkeeping (Hand made at that time) • Coordination between different activities over the Condor Pool(s) • We were not the only user, and some of the time constraints for the production, required a co-ordination • In particular, when going to the national WAN implementation, we faced large fluctuations in response time and in consistency of local machines • Well known, nowadays in Grid …

The real challenge (2/2) • SUN OS to Alpha OS required some different configuration, and of course, compilation • Some of the CMSIM Fortran packages for a CMS sub-detector could not be exported, so were dropped • fortunately not important for the Physics scope • All the jobs were submitted via a single SUN station • Limited resources for the many jobs input and output • Complicated procedure to get the Alpha executable available • Single point of failure for all the simulations • And all the participating persons had to have a local account and coordinate themselves • The results of the simulation had to be available to all CMS • Some GByte of data over AFS? Not possible at that time: Procedures to get the data exported (FTS) and permanently stored (Local Tape trivial System)

The successful Challenge • Looking back to the (lost) Web Pages: available in Bologna (Oct 1998) • 30 datasets, each of 4000 events, of single muon signal at 4 GeV • 30 datasets, each of 4000 events, of single muon signal at 25 GeV • 30 datasets, each of 4000 events, of single muon signal at 200 GeV • 30 datasets, each of 1000 events, of Higgs events at the planned masses • All the data were produced in a considerable short time, given the resources that CMS had dedicated to the Experiment in Bologna • As an example a dataset was produced in a 3 days time over the Condor Pool, against a 17 days time on a dedicated machine ! • Condor proved to be VERY robust against machine crashes and network interruptions • We experimented both the Network and the machine crashes: in both cases we could recuperate the “running” jobs without human intervention (more or less …) • Checkpointing of Condor was a key issue in this scenario

And we continued … (Bo+Pd)October 1999 Report 15 days on 6 SUN in the Condor Pool of Padova Same effort on the Bologna Pool

The machines (resources) used • Bologna Condor Pool • 19 Digital Alpha Unix 4.0x • 3 HP-UX 10.20 • 8 PC Linux 2.0.3.0 • We used them ! • 2 SGI IRIX 6.2 or 6.3 • 1 SUN Solaris 2.5 • Located in two WAN connected sites: RPC access • The INFN Wan Condor • 48 Digital Alpha (various Unix releases) • 14 HP-UX • 17 PC Linux • 2 SGI IRIX • 1 SUN Solaris    

Performance evaluations (CMSIM on Condor) • A Computer-Science Thesis by Ilaria Colleoni (1998-99) (Co-tutor: C. Grandi) • Attempt to numerically evaluate the running of CMSIM on Condor • With “real” simulation jobs of different computing loads • Single Muons (4 GeV, 25 GeV, 200 GeV) • Higgs (2muons) of different masses • CPU times/job: from ~4 hours up to ~45 hours • Both in a Local Condor Pool (Bologna) and in the INFN WAN Condor environment • Alpha platform used, but submitting machine was a SUN • Checkpointing enabled (exe ~140 MB) • All I/O operations (when needed) via RPC

Single muon events Local Pool • Increasing computational load for the different momenta • 4 GeV, 25 GeV, 200 GeV • Comparison of the CPU time on Condor with a locally-run identical simulations • Normalization of the CPU time on Condor accounting for the different CPU power of the used nodes (+ some other consideration, like memory, etc.)

Single muon events WAN Pool • Same kind of jobs • Would have required ~week to execute on the Local Pool • Got the results in ~3 days • Same Normalization of the CPU time • Estimate of the WAN running load

Some (historical) Issues • During that “first Data Challenge” we faced for the first time the “data” problem: • We were worried of the I/O of jobs, over the LAN and WAN • And we discovered that the simulation jobs are so CPU intensive that it was a negligible problem, even with those bandwidths • It might be a problem with the current CPUs • But we had to cope with the disk space of the submitting machine • And then we had to find a way to make the produced data available for access (copies) • Nowadays we know that the real problem is not the distributed computing, but the distribution of data accesses • Another point was the predictability of the Condor System • I remember long discussions with Miron and Paolo (in his office), to try to understand if Condor could be a solution for “Distributed Analysis” • Is it solved?

Conclusion • CMS (Bologna) started at that time to use the “distributed computing” to perform a “simulation challenge” • We found everything (mostly) ready, thanks to Condor • And it was a success ! • CMS (at large) has gone through many “computing, data and analysis challenges” since then • Many of them were successful (and we hope we will be successful with the “real challenge” of “real data”) • However from that exercise in 1998-99 we learnt a lot: • Distributed Services, Coordination, etc. • And very important: Robustness of the underlying software ! • That (modest) Data Challenge was the precursor of a GRID activity, that, since then, took most of our time …

First evaluations (Ilaria) • Running the production • Problems • People • Pools • Resources • Results (Bo+Pd) • Some issues • Historical (Miron & Paolo presentations) • Dependencies of the available Condor (CPU vs I/O) • Predictability of the results, or simulation vs analyses • Conclusions • First “distributed” CMS challenge • Grid precursor

The First CMS Data Challenge (~1998/99)