480 likes | 489 Views
Learn about the ALICE Tier2/3 setup at GSI, including the GSI Lustre cluster, GSIAF integration, and user experience with the ALICE Analysis Train.
E N D
User experience with the ALICE Tier2/3 @GSI A. Andronic, A. Kalweit, A.Manafov, A.Kreshuk, C.Preuss, D.Miskowiec, J.Otwinowski, K. Schwarz, M. Ivanov, A. Marin, M.Zynovyev, P. Braun-Munzinger, P.Malzacher, S. Radomski, S. Masciocchi, T.Roth, V.Penso, W.Schoen (ALICE-GSI)
Outline • Introduction: • about GSI • about ALICE • The GSI Tier2/3 • The GSIAF • The GSI lustre cluster • The Grid@GSI • Conclusions
GSI: Gesellschaft für SchwerionenforschungGerman Institute for Heavy Ion Research ~1000 employees ~1000 guest scientists Budget: ~95 Mio Euro
FAIR GSI as of today
ALICE Collaboration > 1000 Members ~ 30 Countries ~ 100 Institutes ALICE@GSI: Large participation in TPC and TRD Detector calibration Physics Analysis
The ALICE Grid Map Europe Africa Asia North America ALICE-FAIR meeting 8
Alice Tier 2/3 @GSI: Size/Ramp-up plans Capacity is for the Tier 2 (fixed via WLCG MoU) +1/3 for the Tier 3 http://lcg.web.cern.ch/LCG/C-RRB/MoU/WLCGMoU.pdf
What we want to provide:a mixture of a Tier 2 a Tier 3 a PROOF farm with local storage: GSIAF integrated in the standard GSI batch farm (GSI, FAIR) We want to be able to readjust the relative size of the different parts on request.
Investment plans at GSI: ALICE Tier 2 GSI Invest FAIR ALICE T2 Time
GSI – current setup CERN GridKa 1 Gbps Grid 3rd party copy test cluster10 TB 80 TB ALICE::GSI::SE::xrootd vobox Grid CE LCG RB/CE GSI batchfarm: ALICE cluster (160 nodes/1500 cores for batch: 20 nodes for GSIAF) Directly attached disk storage (81 TB) PROOF/Batch GSI batchfarm: Common batch queue Lustre Clustre 150 TB GSI
Present Status • ALICE::GSI:SE::xrootd • 75 TB disk on fileserver (16 FS a 4-5 TB each) • 3U 12*500 GB disks RAID 5 • 6 TB user space per server • Batch Farm/GSIAF • gave up concept of ALICE::GSI::SE_tactical::xrootd • not good to mix local and Grid access • cryptic file names make non Grid access difficult nodes dedicated to ALICE (Grid+local) (used by FAIR/Theory if free) • ~1500 CPU' • 15*4 = 60 Cores, 8 GByte RAM, 2 Tbyte Disk + System (D-GRID) • 25*8 = 200 Cores, 16 GByte RAM, 2 Tbyte Disk in a RAID 5 (ALICE) • 40*8 = 320 Cores, 32 GByte RAM, 2 Tbyte Disk in a RAID 5 (D-GRID) • 7*16*8 = 896 Cores, 16 GByte RAM, 2 * 128 GByte Disks in a RAID MIRROR (Blades) (ALICE) • on all machines: Debian Etch 64bit
PROOF – user experience • Proof cluster: 20 x 8 = 160 workers • Used heavily for code development and debugging as it providesfastresponse onlarge statistics • For example, ~1.4 TBytes of data are processed in ~20 minutes for a very CPU-intensive analysis • Overall, the users we arevery happywith it • (almost) everything is allowed – we can still handle it with 6-8 active users • All machines see an NFS-mounted disk • users can use their own libraries • Large disk space (lustre + local disks) • Intermediate results at many points can be studied
Installation • shared NFS dir, visible by all nodes • xrootd ( version 2.9.0 build 20080621-0000) • ROOT (521-01-alice and 519-04) • AliRoot (head) • all compiled for 64bit • reason: due to fast software changes • disadvantage: possible NFS stales • started to build Debian packages of the used software to install locally
Configuration • Setup: 1 standalone, high end 32 GB machine for xrd redirector and proof master, Cluster: xrd data servers and proof workers, AliEn SE, Lustre • So far no authentification/authorization • Via Cfengine • platform independent computer administration system (main functionality: automatic configuration). • xrootd.cf, proof.conf, TkAuthz.Authorization, access control, Debian specific init scripts for start/stop of daemons (for the latter also Capistrano and LSF methods for fast prototyping) • All configuration files are under version control (subversion)
PROOF cluster - issues • But, still there are some problems: • Transparency for users • “It runs fine locally, but crashes on PROOF, how do I find where the problem is?” • Fault Tolerance • Much progress in the last year, but still our problem #1 • The worst is that misbehavior of one user session can kill the whole cluster • Happens rarely, but needs manual administrator intervention
The upgraded (alpha) GSI Lustre Cluster Running lustre 1.6.4.3, Debian 2.6.22 kernel • 27 (17) Object Storage Servers, in ``fail out mode'' • Roughly 135 (80)TBytes volume (RAID 5) • Ethernet connections (27(17) x 1 Gbit/s). Bonding tested (2x1 Gbit/s per OSS), but hardware not available • ~ 1500 (400) ALICE client CPU's Other talks: W. Schoen, St Louis (2007), CERN (2008, HEPIX) S. Masciocchi (CERN,2008)
The ALICE Analysis Train The concept: (ROOT, ALICE) • Experimental data have large volume (200 kBytes/event) • All data stored in ROOT format • The data analysis is dominated by input/output latencies • Idea: load data once and run many analyses (train) • The ALICE Analysis Framework (A. Morsch, A. Gheata, et al.) The GSI analysis train: • 12 physics analyses (CPU/total time ~ 0.75) • Reads simulated events from lustre • Runs as batch jobs on the local farm
Speed results Due to I/O, CPU/total time improves with the train
Performance (with 17 FS) • Total n. of events/sec versus number of parallel jobs data on lustre * node filled with MC jobs data on one local disk * node filled with MC jobs Saturation due to network limitation! If (4000 ev/sec) we need 3 days to analyze 109 events (1 year@LHC)
Network traffic-1 (with 17 FS) • 10 Gbit connection • switch giffwsx41 (the best one) • 20 nodes No problems on the 10Gbit links
Network traffic-2 (with 17 FS) • file server lxfsd011 • 1 Gbit connection • for each of the current 17 file servers Very close to saturation on the 1Gbit links!!!
Network traffic again (with 27 FS) 10 Gbit connection, switch giffwsx41 Now: data traffic better distributed
Next Generation Cluster • Soon available: running lustre 1.6.5 (Move to Version 1.8.x when available) • 35 Object Storage Servers • Initially 160 TBytes volume, later 600 TBytes • MDS: 2 servers in a High Availability configuration • Ethernet connections (100x1 Gbit/s) • ~1400 ALICE client CPU's • ~ total 4000 GSI client CPU's • quotas will be enabled Walter Schoen Thomas Roth
ALICE Grid jobs computed at GSI > 50000 GSI: 1% Job efficiency at GSI: 80.6%
Conclusions • Coexistence of interactive and batch processes (PROOF analysis on staged data and Grid user/production jobs) on the same machines can be handled !!! • re”nice” LSF batch processes to give PROOF processes a higher priority (LSF parameter) • number of jobs per queue can be increased/decreased • queues can be enabled/disabled • jobs can be moved from one queue to other queues • Currently at GSI each PROOF worker is an LSF batch node • optimised I/O. Various methods of data access (local disk, file servers via xrd, mounted lustre cluster) have been investigated systematically. Method of choice: Lustre and eventually xrd based SE. Local disks are not used for PROOF anymore at GSIAF. • PROOF nodes can be added/removed easily • Administrative overhead with local disks is larger compare to with a file cluster • extend GSI T2 and GSIAF according to promised ramp up plan
Acknowledgements: The Team A. Andronic, A. Kalweit, A.Manafov, A.Kreshuk, C.Preuss, D.Miskowiec, J.Otwinowski, K. Schwarz, M. Ivanov, A. Marin, M.Zynovyev, P. Braun-Munzinger, P.Malzacher, S.Radomski, S. Masciocchi, T.Roth, V.Penso, W.Schoen (ALICE-GSI, IT-GSI)
The ALICE computing model (1/2) • pp • Quasi-online data distribution and first reconstruction at T0 • Further reconstructions at T1’s • AA • Calibration, alignment and pilot reconstructions during data taking • Data distribution and first reconstruction at T0 during four months after AA • Further reconstructions at T1’s • One copy of RAW at T0 and one distributed at T1’s
The ALICE computing model (2/2) • T0 • First pass reconstruction, storage of one copy of RAW, calibration data and first-pass ESD’s • T1 • Reconstructions and scheduled analysis, storage of the second collective copy of RAW and one copy of all data to be kept, disk replicas of ESD’s and AOD’s • T2 • Simulation and end-user analysis, disk replicas of ESD’s and AOD’s
The Transition Radiation Detector e-identification • 18 supermodules • 6 radial layers • 5 longitudinal stacks • 540 chambers • 750m2 active area • 28m3 of gas Each chamber: ≈ 1.45 x 1.20m2 ≈ 12cm thick (incl.Radiators and electronics) in total1.18 million read out channels
TRD assembly and installation 4 SM's are installed
Present Status • ALICE::GSI:SE::xrootd • 75 TB disk on fileserver (16 FS a 4-5 TB each) • 3U 12*500 GB disks RAID 5 • 6 TB user space per server • Batch Farm/GSIAF • gave up concept of ALICE::GSI::SE_tactical::xrootd • not good to mix local and Grid access • cryptic file names make non Grid access difficult nodes dedicated to ALICE (Grid+local) (used by FAIR/Theory if free) • 1500 CPU's • 160 boxes: 1200 cores (to a large extend funded by D-Grid): each • 2*2core 2.67 GHz Xeon, 8 GB RAM • 2.1 TB local disk space on 3 disks + system disk • Additionally 24 new boxes: each • 2*4core 2.67 GHz Xeon, 16 GB RAM • 2.0 TB local disk space on 4 disks including system • up to 2*4 core, 32GB RAM and Dell Blade Centres • on all machines: Debian Etch 64bit