300 likes | 430 Views
Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague. EGEE Summer School, Budapest, 13.7.2005. www.eu-egee.org. EGEE is a project funded by the European Union under contract IST-2003-508833. Contents. Single job vs. big production ATLAS Data Challenges
E N D
Large scale simulations on the EGEE Grid Jiri ChudobaFZU and CESNET, Prague EGEE Summer School, Budapest, 13.7.2005 www.eu-egee.org EGEE is a project funded by the European Union under contract IST-2003-508833
Contents • Single job vs. big production • ATLAS Data Challenges • Job rates, distributions • Problems – global, local • Outlook EGEE Summer School, Budapest, 13.7.2005 - 2
From 1 Job to Big Productions • basic job submition on the EGEE grid: • edg-job-submit my.jdl • edg-job-status –i myjobs • edg-job-get-output –i myjobs • ~100 of jobs – still manageable with a couple of shell scripts with loops around basic commands • ~1000 of jobs – need some small database to be able to resubmit failed jobs • LCG experiments require much more – complicated systems around basic commands • ATLAS Requirements for 2008 • CPU: 51 MSI2K • (34 000 current processors) • Storage: • disk: 25 000 TB • tape: 17 000 TB EGEE Summer School, Budapest, 13.7.2005 - 3
Example: ATLAS Data Processing • ATLAS (A Toroidal LHC Apparatus) experiment at the Large Hadron Collider at CERN will start taking data in 2007. • proton-proton collisions with a 14 TeV center-of-mass energy with rate 109 Hz, storage rate 200 Hz • Total amount of “raw” data: 1 PB/year • Analysis Object Data target size 100 kB per event • Each collaborator must have transparent access to data ~2000 Collaborators ~150 Institutes 34 Countries EGEE Summer School, Budapest, 13.7.2005 - 4
ATLAS Productions • Large scale tests of readiness for many components • ATLAS Data Challenge 1 (2002) – production sites independent, no grid tools • drawbacks: • long delays for some jobs, although some sites already idle • at least 1 ATLAS person per site • ATLAS DC2 (2004) – first big test of the ATLAS production system • upgrade of LCG mw during production, several versions of ATLAS sw • ATLAS Rome production – simulations for the ATLAS Physics Workshop in Rome (June 2005) EGEE Summer School, Budapest, 13.7.2005 - 5
ATLAS Production System • ATLAS uses 3 grids: • LCG (=EGEE) • NorduGrid (evolved from EDG) • GRID3 (US) • plus possibility for local batch submition 4 interfaces • Input and output data must be accessible from all grids • jobs vary in requirements on I/O, CPU time and RAM • ATLAS developed custom system from several components EGEE Summer School, Budapest, 13.7.2005 - 6
ATLAS Production System Overview prodDB AMI dms Don Quijote Windmill super super super super super soap jabber jabber jabber soap LCG exe LCG exe NG exe G3 exe LSF exe Capone Dulcinea Lexor RLS RLS RLS LCG NG Grid3 LSF EGEE Summer School, Budapest, 13.7.2005 - 7
ATLAS Production Rates EGEE Summer School, Budapest, 13.7.2005 - 8
DC2 on LCG EGEE Summer School, Budapest, 13.7.2005 - 9
Production Rates • July – September 2004 DC2 GEANT4 simulation (long jobs) LCG/EGEE : GRID3 : NorduGrid = 40 : 30 : 30 • October – December 2004 DC2 digitization and reconstruction (short jobs) • February – May 2005 Rome production (mix of jobs) LCG/EGEE : GRID3 : NorduGrid = 65 : 24 : 11 • CondorG improved efficiency of the LCG sites usage EGEE Summer School, Budapest, 13.7.2005 - 10
ATLAS Rome Production: countries (sites) • Austria (1) • Canada (3) • CERN (1) • Czech Republic (2) • Denmark (3) • France (4) • Germany (1+2) • Greece (1) • Hungary (1) • Italy (17) • Netherlands (2) • Norway (2) • Poland (1) • Portugal (1) • Russia (2) • Slovakia (1) • Slovenia (1) • Spain (3) • Sweden (5) • Switzerland (1+1) • Taiwan (1) • UK (8) • USA (19) 22 countries 84 sites 17 countries; 51 sites 7 countries; 14 sites EGEE Summer School, Budapest, 13.7.2005 - 11
Rome Production Statistics 73 data sets containing 6.1M events simulated and reconstructed (without pile-up) Total simulated data: 8.5M events Pile-up done later (for 1.3M events done, 50K reconstructed) EGEE Summer School, Budapest, 13.7.2005 - 12
Rome Production: Number of Jobs As of 17 June 2005 EGEE Summer School, Budapest, 13.7.2005 - 13
Critical Services • RB, BDII, RLS, SE, UI, DQ, MyProxy, DB servers • just 1 combo machine with RB, UI, BDII, DQ in the beginning of DC2 • quickly evolved in a complex system of many services running on many machines • RB – several machines, 1 RB at CERN with reliable disk array, can change if 1 machine has problem (but some jobs may be lost) • UI – 1 or 2 machines per submitter, up to 1000 jobs handled by 1 instance of lexor. Big memory requirements • BDII – 2 machines behind DNS alias • RLS – single point of failure, cannot be replaced by other machine, problems in the beginning of May stopped production and data replication for several days. Missing features implemented in new catalogues. EGEE Summer School, Budapest, 13.7.2005 - 16
Critical servers (cont) • DQ server for data management • production DB – Oracle server shared with other clients, service guaranteed by CERN db group • other DB’s required by ATLAS sw: geometryDB, conditionsDB • MySQL servers • Hard limit of a 1000 connections to a server was hit during Rome production. Replica servers were quickly introduced and code was change to select between them. • SE – problems if input data are on a SE which is down EGEE Summer School, Budapest, 13.7.2005 - 17
Monitoring • Production overview: • via proddb – ATLAS specific • Grid monitors: • GOC monitor: http://goc.grid-support.ac.uk/gridsite/monitoring/ • Site Functional Tests • BDII monitors (several) • http://hpv.farm.particle.cz/chudoba/atlas/lcg/bdii/html/latest.html • http://www.nordugrid.org/applications/prodsys/lcg2-atlas.php • http://www.mi.infn.it/~gnegri/rome_bdii.htm EGEE Summer School, Budapest, 13.7.2005 - 18
Monitoring (con’t) • GRIDICE • ATLAS VO view: http://gridice2.cnaf.infn.it:50080/gridice/vo/vo_details.php?voName=atlas • ATLAS production view http://atlfarm003.mi.infn.it/~negri/cgi-bin/rome_jm.cgi EGEE Summer School, Budapest, 13.7.2005 - 19
Black Holes • Wrongly configured site can attract many jobs, since it process them in very short time • Protection: • ATLAS sw not found – job sends a mail and sleeps for 4 hours • often caused by nfs server problems • Automatic site exclusion from the BDII if it does not pass SFT • SFT run once a day – too long delay for big production sites • Sometimes error caused by external site • possibility to include/exclude sites by hand (2 persons in different time zones to cover almost 24 hours/day) • since spring 2005 possibility to select which tests are critical • VO dependent selection! • statistics from ATLAS prodDB EGEE Summer School, Budapest, 13.7.2005 - 20
Jobs validation • output files are marked by suffix with attempt number • only when job is validated (by parsing its log file) output files are renamed and job marked as validated renaming of physical files would be difficult (may be on a tape!) • only entry in the catalogue is changed • pollution of disk and tape space with files from failed attempts, no systematic clean up done EGEE Summer School, Budapest, 13.7.2005 - 21
Monitor of Blocked CPUs • Started by doing qstat on a local farm • Later extended to more sites using globus-job-run • Only some sites scanned, LSF not supported • http://www-hep2.fzu.cz/~chudoba/atlas/lcg/last-bad.html EGEE Summer School, Budapest, 13.7.2005 - 22
Jobs distribution • Some sites had many jobs in queues, some had free resources • Different ranking expression tried based on ERT, number of waiting jobs, ... • Local sites of submitters were better used • Problems if site publishes incorrect info (local MDS or BDII stuck or died) • Missing information per VO, new Glue scheme should help • CondorG bypassed RB – better distribution • LHCb approach: submit many jobs as placeholders, actual content of a job defined only when the job starts EGEE Summer School, Budapest, 13.7.2005 - 23
Local problems (experience from 2 CZ sites) • Load on the local SE, when many jobs start at once • no crash, but enormous increase of clock time per job • dCache installed on a new machine, under tests now • Stability of the NFS server • solved by several upgrades of kernel • Disk array crash • backplane problem lead to data loss • RLS entries for lost files removed • HyperThreading • not used on nodes for ATLAS production • some increase of performance in simulation tests, but introduces more dependence on other jobs running on the same node EGEE Summer School, Budapest, 13.7.2005 - 24
Local problems (cont) • Job distribution based on the default expression for ERT • no new jobs for bigger site if some jobs were already running although some CPU still free • OK after change in evaluation of ERT • Remote SE overloaded or down • input data on SE with problems – lcg-cp command blocks • timeout introduced later in lexor • predefined SE for output not available – lcg-cr blocks • Misbehaved jobs had infinite loop with output to a log file, took all available space on a local disk • crashed the other job on the WN too • $EDG_JOB_SCRATCH definition was missing on a reinstalled WN job filled shared /home partition EGEE Summer School, Budapest, 13.7.2005 - 25
Special Jobs (DC2) • Pileup jobs combine signal event with several background events • Required > 1 GB RAM per job, 700 GB of input data with background events • Data were copied to selected sites in advance using DQ • Special “InputHint” for these jobs • Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal • Selected sites allowed atlas jobs only on machines with enough RAM EGEE Summer School, Budapest, 13.7.2005 - 26
Pileup jobs EGEE Summer School, Budapest, 13.7.2005 - 27
Conclusions • ATLAS DC2 and Rome production were done on 3 grids, LCG/EGEE had the biggest share • Rate of several thousands jobs/day achieved • MW is not yet mature enough, many problems met, but were solved by workarounds and ad-hoc fixes • Productions require a lot of manpower • mostly covered by ATLAS • good support from EIS team • some services managed by other CERN groups (DB, RLS, BDII, CASTOR, ...) EGEE Summer School, Budapest, 13.7.2005 - 28
Outlook • Only large scale tests as DC can find certain problems and bottlenecks in the system – must continue • Service Challenges • several phases, start with tests of basic services and add more • phase 3 during 2nd half of 2005 will include LHC experiments • gLite components should solve some problems (WMS, new data catalogues) • ATLAS DC3 (Computing System Commissioning) in 2006 • LHC experiment data taking starts already 2007 – we must have reliable well tested system then EGEE Summer School, Budapest, 13.7.2005 - 29
Thanks to • my colleagues from ATLAS production team for a good cooperation leading to such good results • special thanks to Gilbert Poulard for charts • organizers of this EGEE school EGEE Summer School, Budapest, 13.7.2005 - 30