190 likes | 200 Views
Explore the first data challenge of the CMS project, its challenges, and the use of Condor for simulation and reconstruction programs. Learn about the environment, history, and the efforts made by the CMS team.
E N D
The First CMS Data Challenge (~1998/99) Using Condor
Disclaimer • Official presentations of those activities are no more available… • Long time ago • Used machines already dismissed since time • Files lost in the dismissed disks • Only fragment of information still around • Mosltly on “printed” slides and unlinked Web pages • And … my memory is not as it was at that time … • However I could find some information and, without surprise, a number of “well known” names • The list of them will certainly forget somebody, so I’ll avoid to do it, • but PAOLO MAZZANTI is worth to be mentioned !
Environment and History • CMS Simulation program (CMSIM) using Geant3 (Fortran) • Different versions in rapid development • Objectivity at that time for CMS • First CMS reconstruction programs using C++ • SUN OS and HP Unix were the CMS basic operating systems • But Linux rapidly growing • And we had a legacy of a lot of Digital Alphas from LEP • ~Year 2000 INFN started to fund PCs Farms • In 1999 INFN launched the INFN-Grid project • The MONARC project was running at CERN Then … we were flooded by GRID and Tiers
The Data Challenge start … • From the minutes of a meeting of 14 May 1998: • Need to generate 360.000 events of single muon (3 different momenta) and 30.000 events of Higgs -> 2 muons (3 different masses) • To be done over Condor, starting June 98 • CMSIM code has been ported from SUN to Alpha: needs to be “linked” with Condor libraries • Local running of tests on Higgs simu gave ~1.4 min/event on both Alpha and SUN (with ~ 5 min of program initialization): >700 hours of CPU time for that sample of events • From another meeting of 13 May 1998: • Planning the National (INFN) Condor pool (~57 machine available) • CMSIM is one of the possible applications over WAN • GARFIELD (electric field simulation of the CMS Muon Detectors, DC cells) will run only locally (checkpoint file too big! Less than a today mail attachment …)
The challenge … before starting • We (CMS Bologna) were already using Condor!
3 Method Used to Producethe Drift Times • Full simulation on ALPHA machines ; • Bologna Condor facility used ; • Four tracks for each x, , B considered . • • For each track we assumed the drift time is given by : • 50% one electron • 40% two electrons • 10% three electrons 9/12/1998 Report P. Capiluppi
5 • Drift Lines • when • Bw = 0.3 T 9/12/1998 Report P. Capiluppi
And we did started • A strange (to my mind) CMS Simulation statement (dated 20 Apr 1998) • The objective was to measure the throughput (in terms of CMS simulated events per hour) of the our Condor Pool … At the beginning we had some compatibility problem between the CERN Library and the Condor libraries, but the Condor Team promptly solved these problems. This has to be stressed again: the support from the Condor team is very good! • Indeed we (CMS Italy) started in that period to support (in concrete, even if small contribution) the Condor team • The number of machines running simulation under Condor was from 9 to 19! The 40% of jobs were checkpointed (we note that in the CMS case the checkpoint file was of the order of 66 MB!).
The real challenge (1/2) • CMSIM jobs were mostly CPU intensive • Very small I/O, compared to the CPU time required by the simulation of the number of events/job (carefully chosen) • Executable of the order of 140 Mbytes • Some of the Simulation programs required the access to input data (via RPC, not NFS, even in the “local” environment of Bologna). • Small in size in any case: ~130 KBytes/event read, same amount write • Some of the jobs had a larger I/O: ~600 KBytes/event • Propagation of the random seed for Simulation among the jobs • Required a careful bookkeeping (Hand made at that time) • Coordination between different activities over the Condor Pool(s) • We were not the only user, and some of the time constraints for the production, required a co-ordination • In particular, when going to the national WAN implementation, we faced large fluctuations in response time and in consistency of local machines • Well known, nowadays in Grid …
The real challenge (2/2) • SUN OS to Alpha OS required some different configuration, and of course, compilation • Some of the CMSIM Fortran packages for a CMS sub-detector could not be exported, so were dropped • fortunately not important for the Physics scope • All the jobs were submitted via a single SUN station • Limited resources for the many jobs input and output • Complicated procedure to get the Alpha executable available • Single point of failure for all the simulations • And all the participating persons had to have a local account and coordinate themselves • The results of the simulation had to be available to all CMS • Some GByte of data over AFS? Not possible at that time: Procedures to get the data exported (FTS) and permanently stored (Local Tape trivial System)
The successful Challenge • Looking back to the (lost) Web Pages: available in Bologna (Oct 1998) • 30 datasets, each of 4000 events, of single muon signal at 4 GeV • 30 datasets, each of 4000 events, of single muon signal at 25 GeV • 30 datasets, each of 4000 events, of single muon signal at 200 GeV • 30 datasets, each of 1000 events, of Higgs events at the planned masses • All the data were produced in a considerable short time, given the resources that CMS had dedicated to the Experiment in Bologna • As an example a dataset was produced in a 3 days time over the Condor Pool, against a 17 days time on a dedicated machine ! • Condor proved to be VERY robust against machine crashes and network interruptions • We experimented both the Network and the machine crashes: in both cases we could recuperate the “running” jobs without human intervention (more or less …) • Checkpointing of Condor was a key issue in this scenario
And we continued … (Bo+Pd)October 1999 Report 15 days on 6 SUN in the Condor Pool of Padova Same effort on the Bologna Pool
The machines (resources) used • Bologna Condor Pool • 19 Digital Alpha Unix 4.0x • 3 HP-UX 10.20 • 8 PC Linux 2.0.3.0 • We used them ! • 2 SGI IRIX 6.2 or 6.3 • 1 SUN Solaris 2.5 • Located in two WAN connected sites: RPC access • The INFN Wan Condor • 48 Digital Alpha (various Unix releases) • 14 HP-UX • 17 PC Linux • 2 SGI IRIX • 1 SUN Solaris
Performance evaluations (CMSIM on Condor) • A Computer-Science Thesis by Ilaria Colleoni (1998-99) (Co-tutor: C. Grandi) • Attempt to numerically evaluate the running of CMSIM on Condor • With “real” simulation jobs of different computing loads • Single Muons (4 GeV, 25 GeV, 200 GeV) • Higgs (2muons) of different masses • CPU times/job: from ~4 hours up to ~45 hours • Both in a Local Condor Pool (Bologna) and in the INFN WAN Condor environment • Alpha platform used, but submitting machine was a SUN • Checkpointing enabled (exe ~140 MB) • All I/O operations (when needed) via RPC
Single muon events Local Pool • Increasing computational load for the different momenta • 4 GeV, 25 GeV, 200 GeV • Comparison of the CPU time on Condor with a locally-run identical simulations • Normalization of the CPU time on Condor accounting for the different CPU power of the used nodes (+ some other consideration, like memory, etc.)
Single muon events WAN Pool • Same kind of jobs • Would have required ~week to execute on the Local Pool • Got the results in ~3 days • Same Normalization of the CPU time • Estimate of the WAN running load
Some (historical) Issues • During that “first Data Challenge” we faced for the first time the “data” problem: • We were worried of the I/O of jobs, over the LAN and WAN • And we discovered that the simulation jobs are so CPU intensive that it was a negligible problem, even with those bandwidths • It might be a problem with the current CPUs • But we had to cope with the disk space of the submitting machine • And then we had to find a way to make the produced data available for access (copies) • Nowadays we know that the real problem is not the distributed computing, but the distribution of data accesses • Another point was the predictability of the Condor System • I remember long discussions with Miron and Paolo (in his office), to try to understand if Condor could be a solution for “Distributed Analysis” • Is it solved?
Conclusion • CMS (Bologna) started at that time to use the “distributed computing” to perform a “simulation challenge” • We found everything (mostly) ready, thanks to Condor • And it was a success ! • CMS (at large) has gone through many “computing, data and analysis challenges” since then • Many of them were successful (and we hope we will be successful with the “real challenge” of “real data”) • However from that exercise in 1998-99 we learnt a lot: • Distributed Services, Coordination, etc. • And very important: Robustness of the underlying software ! • That (modest) Data Challenge was the precursor of a GRID activity, that, since then, took most of our time …
First evaluations (Ilaria) • Running the production • Problems • People • Pools • Resources • Results (Bo+Pd) • Some issues • Historical (Miron & Paolo presentations) • Dependencies of the available Condor (CPU vs I/O) • Predictability of the results, or simulation vs analyses • Conclusions • First “distributed” CMS challenge • Grid precursor