370 likes | 470 Views
ALICE. IL CALCOLO NEL 2004: PHYSICS DATA CHALLENGE III. Massimo Masera masera@to.infn.it Commissione Scientifica Nazionale I 22 giugno 2004. Sommario. The framework: status of AliRoot for the Physics Data Challenge III ( PDCIII ) The production environment: AliEn
E N D
ALICE IL CALCOLO NEL 2004:PHYSICS DATA CHALLENGE III Massimo Masera masera@to.infn.it Commissione Scientifica Nazionale I 22 giugno 2004 ALICE CALCOLO 2004
Sommario • The framework: status of AliRoot for the Physics Data Challenge III (PDCIII) • The production environment: AliEn • AliEn as a meta-Grid: the use of LCG and Grid.It in the PDCIII • Current status of the PDCIII • Phase I is finished • Starting of Phase II • Phase III • Conclusions ALICE CALCOLO 2004
ALICE PHYSICS DATA CHALLENGES ALICE Physics Data Challenge (2003) ALICE CALCOLO 2004
AliRoot ALICE CALCOLO 2004
AliReconstruction AliSimulation ESD AliAnalysis AliRoot layout G4 G3 FLUKA Production env. Interface with the world AliEn ISAJET AliRoot Virtual MC HIJING EVGEN MEVSIM HBTAN STEER PYTHIA6 PDF PMD EMCAL TRD ITS PHOS TOF ZDC RICH HBTP STRUCT CRT START FMD MUON TPC RALICE DPMJET ROOT ALICE CALCOLO 2004
Simulation and Reconstruction in AliRoot • In the present data challenge simulation and reconstruction are steered by 2 classes: AliSimulation and AliReconstruction • simple user interface • AliSimulation sim; • sim.Run(); • AliReconstruction rec; • rec.Run(); • Goal: run the standard simulation and reconstruction for all detectors • This is the simplest example: • the number of events and the config file can be set • merging and region of interest are also implemented ALICE CALCOLO 2004
Event Summary Data (ESD) • The AliESD class is essentially a container for data No functions for the analysis • It is the result of the reconstruction carried out systematically via batch/Grid jobs • It aims to be the starting point for the analysis • At reconstruction time, it can be used to exchange information among different rec. steps ALICE CALCOLO 2004
Event Summary Data (ESD) The following detectors are currently contributing to the ESD: ITS, TPC, TRD, TOF, PHOS and MUON. TPC tracker TRD tracker ITS tracker ESD MUON ITS stand-alone ITS Vertexers PHOS TOF File The ESD structure is sufficient for the following “kinds of physics”: strangeness, charm, HBT, jets (the ones going to be tried in the DC2004). All the objects stored in the ESD are accessed via abstract interfaces (i.e. do not depend on sub-detector code) ALICE CALCOLO 2004
Example: primary vertex STEER directory: Interfaces and subdetector-independent code AliESD AliReconstruction AliVertexer AliESDVertex AliReconstructor::CreateVertexer AliITSVertexer ITS directory AliITSReconstructor::CreateVertexer AliITSVertexerIons AliITSVertexerZ AliITSVertexerFast AliITSVertexerTracks Pb-Pb 3-D info for central events p-p and peripheral events NEW code Just a gaussian smearing of the generated vertex High precision vertexer with rec. tracks (pp D0) ALICE CALCOLO 2004
AliRoot: present situation • Major changes in the last year… • New multi-file I/O in full production • New coordinate system • New reconstruction and simulations “drivers” (AliSimulation and AliReconstruction classes) • First attempt at the ESD and analysis framework • Improvements in reconstruction and simulation • … However, the system is evolving • ESD: the philosophy is still evolving • Introduction of FLUKA and new geometrical modeller • Development of the analysis framework • Raw data for all the detectors (already available for ITS and TPC). • Introduction of the condition database infrastructure ALICE CALCOLO 2004
AliEn and the Grid ALICE CALCOLO 2004
External software AliEn Core Components & services Interfaces RDBMS (MySQL) Database Proxy ADBI DBD DBI User Application File & Metadata Catalogue API (C/C++/perl) LDAP Authentication RB FS External Libraries User Interface Perl Modules Perl Core CLI Config Mgr CE SOAP/XML V.O. Packages & Commands SE GUI Web Portal Package Mgr (…) Logger Low level High level The ALICE Production Environment: AliEn • Standards are now emerging for the basic building blocks of a GRID • There are millions lines of code in the OS domain dealing with these issues • Why not using these to build the minimal GRID that does the job? • Fast development of a prototype, no problem in exploring new roads, restarting from scratch etc etc • Hundreds of users and developers • Immediate adoption of emerging standards • An example, AliEn by ALICE (5% of code developed, 95% imported) ALICE CALCOLO 2004
2001 2002 2003 2004 2005 Start 10% Data Challenge (analysis) Physics Performance Report (mixing & reconstruction) First production (distributed simulation) AliEn Timeline Functionality + Simulation Interoperability + Reconstruction Performance, Scalability, Standards + Analysis ALICE CALCOLO 2004
From AliEn to a Meta-Grid • The Workload Management is “pull-model”: a server holds a master queue of jobs and it is up to the CE that provides the CPU cycles to call it and ask for a job • The system is integrated with a large-scale job submission and bookkeeping system “tuned” for Data Challenge productions, with job splitting, statistics, pie charts, automatic resubmissions, etc. • The Job Monitoring model requires no “sensors” installed on the WN. It is the jobwrapper itself that talks to the server. • Several Grid infrastructures are (becoming) available: LCG, Grid.It, possibly others • Lots of resources but, in principle, different middlewares • Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex ALICE CALCOLO 2004
From AliEn to a Meta-Grid Design strategy: • Use AliEn as a general front-end • Owned and shared resource are exploited transparently • Minimize points of contact between the systems • No need to reimplement services etc. • No special services required to run on remote CE/WNs • Make full use of provided services: Data Catalogues, scheduling, monitoring… • Let the Grids do their jobs (they should know how) • Use high-level tools and APIs to access Grid resources • Developers put a lot of abstraction effort into hiding the complexity and shielding the user from implementation changes ALICE CALCOLO 2004
Available resources for PDC III • Several AliEn “native” sites (some rather large) • Bari, CERN, CNAF, Catania, Cyfronet, FZK, JINR, LBL, Lyon, OSC, Prague, Torino • LCG-2 core sites • CERN, CNAF, FZK, NIKHEF, RAL, Taiwan (more than 1000 CPUs) • At CNAF and Catania, the same resources can be accessed either by LCG/Grid.It and by AliEn • GRID.IT sites • LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not including CNAF) • Implementation:manage LCG resources through a “gateway”: an AliEn client (CE+SE) sitting on top of an LCG User Interface • The whole of LCG computing is seen as a single, large AliEn CE associated with a single, large SE ALICE CALCOLO 2004
installAlice.sh installAlice.jdl installAliEn.sh installAliEn.jdl Software installation • Both AliEn and AliRoot installed via LCG jobs • Do some checks, download tarballs, uncompress, build environment script and publish relevant tags • Single command available to get the list of available sites, send the jobs everywhere and wait for completion. Full update on LCG-2 + GRID.IT (16 sites) takes ~30’ • Manual intervention still needed in few sites (e.g. CERN/LSF) • Ready for integration into AliEn automatic installation system • Experiment software shared area misconfiguration caused most of the trouble in the beginning NIKHEF Taiwan RAL LCG-UI … CNAF TO.INFN ALICE CALCOLO 2004
Server LCG PFN Catalog Catalog LCG LFN LCG LFN = AliEn PFN AliEn, Genius & EDG/LCG • LCG-2 is one CE of AliEn, which integrates LCG and non LCG resources • If LCG-2 can run a large number of jobs, it will be used heavily • If LCG-2 cannot do that, AliEn selects other resources, and it will be less used User submits jobs Alien CE LCG UI Alien CEs/SEs LCG RB LCG CEs/SEs ALICE CALCOLO 2004
Physics Data Challenge III ALICE CALCOLO 2004
PDC 3 schema Production of RAW AliEn job control Data transfer Shipment of RAW to CERN Reconstruction of RAW in all T1-2’s CERN Analysis Tier2 Tier1 Tier1 Tier2 ALICE CALCOLO 2004
Phases of ALICE Physics Data Challenge 2004 • Phase 1 - production of underlying events using heavy ion MC generators • Status: Completed • Phase 2 – mixing of signal events in the underlying events • Status – starting • Phase 3 – analysis of signal+underlying events: • Goal – to test the data analysis model of ALICE • Status – will begin in ~2 months ALICE CALCOLO 2004
Signal-free event Merging Mixed signal ALICE CALCOLO 2004
Statistics for phase 1 of ALICE PDC 2004 • Number of jobs: • Central 1 (long, 12 hours) – 20 K • Peripheral 1 (medium – 6 hours) – 20 K • Peripheral 2 to 5 (short – 1 to 3 hours) – 16 K • Number of files: • AliEn file catalogue: 3.8 million (no degradation in performance observed) • CERN Castor: 1.3 million • File size: • Total: 26 TB • CPU work: • Total: 285 MSI-2K hours • LCG: 67 MSI-2K hours ALICE CALCOLO 2004
Phase I: 1 Pb-Pb event AliEn Catalog. 36 files ALICE CALCOLO 2004
Phase 1 resource statistics: • 27 production centres, 12 major producers, no single site dominating the production • Individual contribution of sites not displayed is on the level of Bari • See slide 28 for a comparison between AliEn and LCG sites • Italian contribution > 40% ALICE CALCOLO 2004
1000000 files Castor problem • Phase 1 CPU profile: • Aiming for sustained running (as allowed by resources availability), average 450 CPUs, max 1450 CPUs (not appearing due to binning) ALICE CALCOLO 2004
Problems with Phase I • Two months delay mainly due to a delayed release of LCG-2 • No SE in LCG-2 + poor storage availability in LCG sites • Natural solution in PDC- Phase I: all files migrated to Castor@Cern • Castor related problems • Initial lack of storage w.r.t. requests (30 TB… not yet available) • 300000 files limit above which the system performance dropped • servers reinstallation in March • LCG: most of the problems are related to the configuration of the sites • Software management tools are still rudimentary • Large sites have often tighter security restrictions & other idiosincrasies • Investigating and fixing problems is hard and time-consuming • The most difficult part of the management is monitoring LCG through a “keyhole”. • Only integrated information available natively • MonALISA for AliEn, GridICE for LCG ALICE CALCOLO 2004
Statistics after round 1 (ended april, 4): job distribution (LCG 46%) Alice::CERN::LCG is the interface to LCG-2 Alice::Torino::LCG is the interface to GRID.IT In the 2nd round AliEn was used more because of the lack of storage continuous stop/start of the production LCG / AliEn ALICE CALCOLO 2004 SITUATION AT THE END OF ROUND 1
Server Catalog Catalog Phase 2 layout Alien CE LCG-UI LCG RB User submits jobs Alien CE/SE • Phase 2 -- about to start • Mixing of the underlying events with signal events (jets, muons, J/) • We plan to use fully LCG DM tools, we may have problems of storage at local SE LCG CE/SE CERN Castor edg-copy-and-register LCG LFN = AliEn PFN lcg://host/<GUID> ALICE CALCOLO 2004
Problems with Phase II • Phase II will generate lots (1M) of (rather small ~7MB) files • We would need an extra stager at CERN, but this is not available at the moment • We could use some TB of disk space, but this too is not available • We are testing a plug-in to AliEn using tar to bunch small files • The space available on the local LCG storage elements seems very low… we will see • Preparation of the LCG-2 JDL is more complicated, due to the use of the data management features • This has introduced a two weeks delay -- we hope to start soon! ALICE CALCOLO 2004
Server Catalog Catalog Phase 3 layout AliEn job splitting tests with ARDA in September … … ARDA workshop today at Cern Alien CE LCG-UI LCG RB lfn 1 lfn 2 lfn 3 lfn 4 lfn 5 lfn 6 Alien CE/SE User query lfn 7 lfn 8 LCG CE/SE LCG CE/SE • Phase 3 -- foreseen in two months • Analysis of signal+underlying events: • Test the data analysis model of ALICE ALICE CALCOLO 2004
ARDA, EGEE, gLite, LCG… • ARDA was a RTAG (Sep 2003) devoted to analysis: • it found AliEn “the most complete system among all considered” • it became a LCG project. Setting up meeting: Jan, 2004 • ARDA is interfaced to the EGEE middleware (gLite), disclosed on May, 18th . Prototype with EGEE MW due by Sep 04 • gLite is presently based on AliEn shell, Winsconsin CE, Globus gatekeeper, VOMS, GAS, … • Next steps (F. Hemmer, PEB, Jun, 7th ): integration of R-GMA and EDG-WMS (developed by INFN) • Support of LCG-2 maintained until EGEE satisfies the requirements of the experiments (PEB, Jun, 7th ) • This picture seems to be reverted now (EGEE, All activities meeting Jun, 18th ): • LCG-2 will evolve to LCG-3, being focused on production • gLite will evolve in parallel and will be focused on development and analysis. • These parallel evolutions will occasionally converge: gLite components will be merged in LCG-x MW as soon as they are completed • This major change within the LCG project occurred abruptly without a prior discussion at PEB and SC2 level and without the approval of the experiments • ALICE will examine the current situation in its next offline week (start: Jun, 28th) ALICE CALCOLO 2004
Conclusions ALICE CALCOLO 2004
We will have an additional DC The difficult start of the ongoing DC taught a lesson: • We cannot stay 18 months without testing our “production capabilities” • In particular we have to maintain the readiness of • Code (AliRoot + MW) • ALICE distributed computing facilities • LCG infrastructure • Human “production machinery” • Getting all the partners into “production mode” was a non-negligible effort • We have to plan carefully size and physics objectives of this data challenge ALICE CALCOLO 2004
ALICE Physics Data Challenges NEW NEW ALICE CALCOLO 2004
PDC04 AliRoot ready PDC06 AliRoot ready PDC06 2004 2005 2006 PDC06 preparation Development of new components Final development of AliRoot ALICE PDC04 Pre-challenge ‘06 First data takingpreparation Analysis PDC04 Design of new components PDC06 ALICE Offline Timeline ALICE CALCOLO 2004
Conclusions • Several problems and difficulties… However our DC is progressing and Phase I is concluded • The DC is completely carried out on the Grid • AliEn • Tools OK for DC running and resources control • Feedback from the CE and WN proved to be essential for early spotting of problems • Centralized and compact master services allow for fast upgrades • DM was working just fine (providing that underlying MSS systems work well) • File catalogue works great, 4M entries and no noticeable performance degradation • AliEn as meta-grid works well, across three grids, and this is a success in itsellf • The INFN contribution to the DC and to the grid activities of the experiment is relevant. • >40% of CPU cycles provided by INFN sites. The efficiency was very high and the cooperation of the site managers was prompt. • The interface between AliEn and LCG/Grit.It has been developed in Italy • We are going to use LCG SE for phase II… • Possible bottle-neck for Phase II: lack of local storage resources • Analysis: • AliEn job splitting • We hope to test the first ARDA prototype in Fall ALICE CALCOLO 2004