ALICE T1/T2 workshop 4-6 June 2013 CCIN2P3 Lyon Famous last words

ALICE T1/T2 workshop 4-6 June 2013CCIN2P3 LyonFamous last words

ALICE T1/T2 workshops • Yearly event • 2011 – CERN • 2012 – KIT (Germany) • 2013 – CCIN2P3 (Lyon) • Aims at gathering middleware and storage software developers, grid operation and network experts, and site administrators involved in ALICE computing activities

Some stats of the Lyon workshop • 46 registered participants, 45 attended • Good attendance, clearly these venues are still popular and needed • 24 presentations over 5 session • 9 general on operations, software, procedures • 15 site-specific • Appropriate number of coffee and lunch breaks, social events • Ample time for questions (numerous) and discussion (lively), true workshop style

Themes • Operations summary • WLCG middleware/services • Monitoring • Networking: LHCONE and IPv6 • Storage: xrood v4. and EOS • CVMFS and AliRoot • Site operations, upgrades and (new) projects, gripes (actually none…)

Messages digest from the presentations • The original slides are available at the workshop indico page • Operations • Successful year for ALICE and Grid operations – smooth and generally problem free, incident handling is mature and fast • No changes foreseen to the operations principles and communication channels • 2013/2014 (LHC LS1) will be years of data reprocessing and infrastructure upgrade • The focus is on analysis – how to make it more efficient

Messages (2) • WLCG middleware • CVMFS installed on many sites, leverage ALICE deployment and tuning through the existing TF • WLCG VO-box is there and everyone should update • All EMI-3 products can be used • SHA-2 is on the horizon, services must be made compatible • glExec – hey, it is still alive! • Agile Infrastructure – IaaS, SaaS (for now) • OpenStack (Cinder, Keystone, Nova, Horizon, Glance) • Management through Puppet (Foreman, MPM, PuppetDB, Hiera, git) … and Facter • Storage with Ceph • All of the above – prototyping and tests, ramping up

Messages (3) • Site dashboard • http://alimonitor.cern.ch/siteinfo/issues.jsp • Get on the above link and start fixing, if you are on the list • LHCONE • The figure • speaks for itself • All T2s should • get involved • Instructions, • expert lists are • in the presentation

Messages (4) • IPv6 and ALICE • IPv4 address space almost depleted, IPv6 is being deployed (CERN, 3 ALICE sites already) • Not all services are IPv6-ready – test and adjustment is needed • Cool history of the network bw evolution • Xrootd 4.0.0 • Complete client rewrite, new caching, non-blocking request (client call-back), new user classes for metadata and data operations, IPv6 ready • Impressive speedup for large operations • API redesigned, no backward compatibility, some cli commands change names • ROOT plugin ready and being tested • Mid-July release target

Messages (5) • EOS • Main disk storage manager at CERN, 45PB deployed 32PB used (9.9/8/3 ALICE) • Designed to work with cheap storage servers, uses software raid (RAIN), ppm probability of file loss • Impressive array of control and service tools (operations in mind) • Even more impressive benchmarks… • Site installation – read carefully the pros/cons to decide if it is good for you • Support – best effort, xrootd type

Messages (6) • ALICE production and analysis software • AliRoot is “one software to rule them all” in ALICE offline • >150 developers, analysis 1M SLOC, reconstruction, simulation, calibration, alignment, visualization: ~1.4M SLOC, supported on many platforms and flavors • In development since 1(8)998 • Sophisticated MC framework with embedded physics generators, using G3 and G4 • Incorporates the full calibration code, which is also run on-line and in HLT (code share) • Encapsulates fully the analysis, a lot of work on improving it, more quality and control checks needed • Efforts to reduce memory consumption in reco • G4 and Fluka in MC

Messages (7) • CVMFS – timeline and procedures • Mature, scalable and supported product • Used by all other LHC experiments (and beyond) • Based on proven CernVM Family • Enabling technology for Clouds, CernVM as a user interface, Virtual Analysis Facilities, opportunistic resources, Volunteer computing, part of a Long Term Data Preservation • April 2014 – CVMFS on all sites, only method of sw distribution for ALICE

Sites Messages (1) • UK • GridPP T1+19, RAL, Oxford and Birmingham for ALICE • Smooth operation, ALICE can (and does) run beyond its pledge, occasional problems with job memory • Test of cloud on small scale • RMKI_KFKI • Shared CMS/ALICE (170 cores, 72TB disk) • Good resources delivery • Fast turnaround of experts, good documentation on operations is a must (done)

Sites Messages (2) • KISTI • Extended support team of 8 people • Tape system tested with RAW data from CERN • Network still to be debugged, but not a showstopper • CPU to be ramped up x2 in 2013 • Well on its way to be the first T1 since the big T1 bang • NDGF • Lose some (PDC), get some more cores (CSC) • Smooth going, dCache will stay and will get location information to improve efficiency • The 0.0009 (reported, not real) efficiency at DCSC/KU still a mystery, however it hurts NDGF as a whole, must be fixed

Sites Messages (3) • Italy • New head honcho – Domenico Elia (grazie Massimo!) • Funding is tough, National Research Projects help a lot for manpower, PON helps with hardware in the south • 6T2s and a T1 – smooth delivery and generally no issues • Torino is a hotbed of new technology – Clouds (OpenNebula, GlusterFS, OpenWRT) • TAF is open for business, completely virtual (surprise!) • Prague • The city is (partially) under water • Current 3.7cores 2PB disk, shared LHC/D0, contributes ~1.5% Grid resources of ALICE+ATLAS • Stable operation, distributed storage • Funding situation is degrading

Sites Messages (4) • US • LLNL+LBL resources purchasing is complimentary and fits well to cover changing requirements • CPU pledges fulfilled, SE a bit underused, on the rise • Infestation of the ‘zombie grass’ jobs, this is California, something of this sort was to be expected… • Possibility for tape storage at LBL (potential T1) • France • 8T2s, 1T1, providing 10% of WLCG power, steady operation • Emphasis on common solutions for services and support • All centres are in LHCONE (7in+7out PB have already passed through it) • Flat resources provisioning for the next 4 year

Sites Messages (5) • India (Kolkata) • Provides about 1.2% of ALICE resources • Innovative cooling solution, all issues of the past solved, stable operation • Plans for steady resources expansion • Germany • 2T2s, 1T1 – the largest T1 in WLCG, provides ~50% of ALICE T1 resources • Good centre names: Hessisches Hochleistungsrechenzentrum Goethe Universität (requires 180IQ to say it) • The T2s have heterogeneous installation (both batch and storage), support many non-LHC groups, well integrated in the ALICE Grid, smooth delivery

Sites Messages (6) • Slovakia • Since 2006 In ALICE • Serves ALICE/ATLAS/HONE • Upgrades planed for air-conditioning and power, later CPU and disk, expert support is a concern • Reliable and steady resources provision • RDIG • RRC-KI (toward T1): Hardware (CPU/Storage) rollout, service installation and validation, personnel is in place, pilot testing with ATLAS payloads • 8T2s + JRAF + PoD@SPbSU, deliver ~5% of the ALICE Grid resources, historically support all LHC VOs • Plans for steady growth and sites consolidation • As all others, reliable and smooth operation

Social events

Victory! I work at a T1! How are you so cool under pressure?

The group

ALICE T1/T2 workshop 4-6 June 2013 CCIN2P3 Lyon Famous last words

ALICE T1/T2 workshop 4-6 June 2013 CCIN2P3 Lyon Famous last words

Presentation Transcript