340 likes | 461 Views
History of the National INFN Pool. P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006. Our first experience (1997). Monte Carlo event generation. WA92 experiment at CERN: Beauty search in fixed target experiment.
E N D
History of the National INFN Pool P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006
Our first experience (1997) • Monte Carlo event generation. • WA92 experiment at CERN: Beauty search in fixed target experiment. • Working conditions: a dedicated farm of 3 Alpha VMS and 6 DecStation Ultrix. • Results: 22000 events/day (0 dead time).
Then Condor came... • Production Condor Pool: • 23 DEC Alpha • 18 Bologna • 2 Cnaf (Bologna) • 2 Turin • 1 Rome • 4 HP • 6 DecStation Ultrix • 5 Pentium Linux
Then Condor came… (cont.) The throughput of the 23 Alpha subset of the pool: 75000 to 100000 events/day plus 15000 events/day with the pool in Madison. We got x5 the production at zero cost!
Give me a calculator… • At INFN : 1000 PCs used 8 hours/day by the owners (16 hours/day idle) • 1000 * 16 = 16000 hours = 1.8 year 1.8 year equivalent CPU wasted each day!
The ‘Condor on WAN’ INFN Project • Approved by the Computing Committee on February 1998. • Goal: install Condor on the INFN WAN and evaluate its effectiveness for the INFN computational needs. • 30 people involved.
The Condor INFN Project (cont.) The INFN Structure • 27 sites • More then 10 experiments on nuclear and sub-nuclear physics. • Hundreds of researchers involved. • Distributed and heterogeneous resources. (good frame for a grid…)
The Condor INFN Project (cont.) The first example in Europe of a national distributed computing environment
Collaboration • INFN and Computer Science Dept. of the University of Wisconsin, Madison • Coordinators for the project: • for Madison: Miron Livny • for INFN: Paolo Mazzanti.
General usage policy Each group of people must be able to maintain full control over their own machines.
General usage policy (cont.) A Condor job sent from a machine of a group must have the maximum access priority on the machines of the same group.
Subpools • rank expression: a resource owner can give priority to requests from selected groups:GROUP_ID = “My_Group”RANK = target.GROUP_ID == “My_Group” • From the group point of view the machines make a pool by themselves: a subpool.
Checkpoint Server Domains • The network could be a concern with a computing environment distributed over a WAN. • Policy: a job should run in the ckpt domain if local resources are available.
The INFN-WAN Pool (2002) ALPHA/OSF1 107 INTEL/LINUX 122 SUN/SOLARIS 6 INTEL/WNT 1 Total 235
Applications • Simulation of the CMS detector. • MC event production for CMS. • Simulation of Cherenkov light in the atmosphere (CLUE). • MC integration in perturbative QCD. • Dynamic chaotic systems. • Extra-solar planets orbits. • Sthocastics differentials equations. • Maxwell equations.
Simulation of Cherenkov light in the atmosphere (CLUE). • Without Condor (1 Alpha): • 20000 events/week. • With Condor: 350000 events in 2 weeks (gain: x9)
Dynamic chaotic systems • Computations based on complex matrix (multiplication,inversion,determinants etc.). • Very CPU-bound with little output and no input. • Gains with Condor respect to the only Alpha used: x3.5 to x10.
MC integration in perturbative QCD • CPU-bound • No input, very small output • Gains with Condor: x10.
Maxwell Equations • 201 jobs, each with a different value of an input parameter. • Output: 401 numbers/jobs • Gains with Condor compared to the only Alpha available: x11
The Pool Today • 8 checkpoint servers: Bologna,Milano,Torino,Pavia,Trieste, Padova,LNGS,Napoli. • 270 CPUs • 45.5 years CPU equivalent used from January to June 25th -> 91 years CPU/year
Why the pool does not grow up? Why Condor is not installed on all PCs? • Is it difficult to install? • Is it difficult to use? • Is it difficult to maintain? • We are prefer to buy new machines?
An automatic installation tool • Three type of installation • server: binary and library only • client: configuration files only • Full: client+server • Rpm files are built up • Web interface http://www.bo.infn.it/calcolo/condor/infn-installation-tool-6.6.7.html
Server installation • Only binaries and libraries • Usually done on nfs or afs servers. It exports bin and lib to the clients
Client installation • Install configuration files using data specified through the web interfaceCreates startup and shutdown scripts for the Condor daemons • Add binaries path (from the ‘server’ installation) in the users PATH
Full installation • Client + Server • All the condor distribution and the configuration files on the same machine • NFS and AFS are not required
Conclusion • The INFN Condor Pool has been the first ‘pre-grid’ wide area distributed computing system. • It is still used by people out from the ‘big science’.
Conclusion (cont.) BUT: why not Condor on each PC? We did not find the answer in 10 years…