210 likes | 225 Views
ALICE T1/T2 workshop 4-6 June 2013 CCIN2P3 Lyon Famous last words. Some stats. 46 registered participants, we’ve counted 45 in the room. Still looking for the mystery No.46… Good attendance, clearly these venues are still popular and needed 24 presentations over 5 session
E N D
ALICE T1/T2 workshop 4-6 June 2013CCIN2P3 LyonFamous last words
Some stats • 46 registered participants, we’ve counted 45 in the room. Still looking for the mystery No.46… • Good attendance, clearly these venues are still popular and needed • 24 presentations over 5 session • 9 general on operations, software, procedures • 15 site-specific • The ‘KIT’ session format worked well this time too • Appropriate number of coffee and lunch breaks, social events • Ample time for questions (numerous) and discussion (lively), true workshop style • With one notable exception (LB), all presenters respected the allotted time
Themes • Operations summary • WLCG middleware/services • Monitoring • Networking: LHCONE and IPv6 • Storage: xrood v4. and EOS • CVMFS and AliRoot • Site operations, upgrades and (new) projects, gripes (actually none…)
Messages digest from the talks(Renaud* and Latchezar’s take) • We attempted to not trivialize the message of the various presentations. It is still better to look at the original slides • Operations • Successful year for ALICE and Grid operations – smooth and generally problem free, incident handling is mature and fast • No changes foreseen to the operations principles and communication channels • 2013/2014 (LHC LS1) will be years of data reprocessing and infrastructure upgrade • The focus is on analysis – how to make it more efficient * Renaud has graciously accepted to be blamed for all incorrect statements
Messages (2) • WLCG middleware • CVMFS installed on many sites, leverage ALICE deployment and tuning through the existing TF • WLCG VO-box is there and everyone should update • All EMI-3 products can be used • SHA-2 is on the horizon, services must be made compatible • glExec – hey, it is still alive! • Agile Infrastructure – IaaS, SaaS (for now) • OpenStack (Cinder, Keystone, Nova, Horizon, Glance) • Management through Puppet (Foreman, MPM, PuppetDB, Hiera, git) … and Facter • Storage with Ceph • All of the above – prototyping and tests, ramping up
Messages (3) • Site dashboard • http://alimonitor.cern.ch/siteinfo/issues.jsp • Get on the above link and start fixing, if you are on the list • LHCONE • The figure • speaks for itself • All T2s should • get involved • Instructions, • expert lists are • in the presentation
Messages (4) • IPv6 and ALICE • IPv4 address space almost depleted, IPv6 is being deployed (CERN, 3 ALICE sites already) • Not all services are IPv6-ready – test and adjustment is needed • Cool history of the network bw evolution • Xrootd 4.0.0 • Complete client rewrite, new caching, non-blocking request (client call-back), new user classes for metadata and data operations, IPv6 ready • Impressive speedup for large operations • API redesigned, no backward compatibility, some cli commands change names • ROOT plugin ready and being tested • Mid-July release target
Messages (5) • EOS • Main disk storage manager at CERN, 45PB deployed 32PB used (9.9/8/3 ALICE) • Designed to work with cheap storage servers, uses software raid (RAIN), ppm probability of file loss • Impressive array of control and service tools (operations in mind) • Even more impressive benchmarks… • Site installation – read carefully the pros/cons to decide if it is good for you • Support – best effort, xrootd type
Messages (6) • ALICE production and analysis software • AliRoot is “one software to rule them all” in ALICE offline • >150 developers, analysis 1M SLOC, reconstruction, simulation, calibration, alignment, visualization: ~1.4M SLOC, supported on many platforms and flavors • In development since 1(8)998 • Sophisticated MC framework with embedded (multiple) generators, using G3 and G4 • Incorporates the full calibration code, which is also run on-line and in HLT (code share) • Encapsulates fully the analysis, a lot of work on improving it, more quality and control checks needed • Efforts to reduce memory consumption in reco • G4 and Fluka in MC
Messages (7) • CVMFS – timeline and procedures • Mature, scalable and supported product • Used by all other LHC experiments (and beyond) • Based on proven CernVM Family • Enabling technology for Clouds, CernVM as a user interface, Virtual Analysis Facilities, opportunistic resources, Volunteer computing, Part of a Long Term Data Preservation • April 2014 – CVMFS on all sites, only method of sw distribution for ALICE
Sites Messages (1) • UK • GridPP T1+19, RAL, Oxford and Birmingham for ALICE • Smooth operation, ALICE can (and does) run beyond its pledge, occasional problems with job memory • Test of cloud on small scale • RMKI_KFKI • Shared CMS/ALICE (170 cores, 72TB disk) • Good resources delivery • Fast turnaround of experts, good documentation on operations is a must (done)
Sites Messages (2) • KISTI • Extended support team of 8 people • Tape system tested with RAW data from CERN • Network still to be debugged, but not a showstopper • CPU to be ramped up x2 in 2013 • Well on its way to be the first T1 since the big T1 bang • NDGF • Lose some (PDC), get some more cores (CSC) • Smooth going, dCache will stay, and will get location information to improve efficiency • The 0.0009 effy at DCSC/KU still a mystery, however hurts NDGF as a whole
Sites Messages (3) • Italy • New head honcho – Domenico Elia (grazie Massimo!) • Funding is tough, National Research Projects help a lot for manpower, PON helps with hardware in the south • 6T2s and T1 – smooth delivery and generally no issues • Torino is a hotbed of new technology – Clouds (OpenNebula, GlusterFS, OpenWRT) • TAF is open for business, completely virtual (surprise!) • Prague • The city is (partially) under water • Current 3.7cores 2PB disk, shared LHC/D0, contributes ~1.5% Grid resources of ALICE+ATLAS • Stable operation, distributed storage • Funding situation is degrading
Sites Messages (4) • US • LLNL+LBL resources purchasing is complimentary and fits well to cover changing requirements • CPU pledges fulfilled, SE a bit underused, on the rise • Infestation of the ‘zombie grass’ jobs, this is California… • Possibility for tape storage at LBL (potential T1) • France • 8T2s, 1T1, providing 10% of WLCG power, steady operation • Emphasis on common solutions for services and support • All centres are in LHCONE (7+7PB have already passed through it) • Flat resources provisioning for the next 4 year
Sites Messages (5) • India (Kolkata) • Provides about 1.2% of ALICE resources • Innovative cooling solution, all issues of the past solved, stable operation • Plans for steady resources expansion • Germany • 2T2s, 1T1 – the largest T1 in WLCG, provides ~50% of ALICE T1 resources • Good centre names: Hessisches Hochleistungsrechenzentrum Goethe Universität (requires 180IQ to say it) • The T2s have heterogeneous installation (both batch and storage), support many non-LHC groups, well integrated in the ALICE Grid, smooth delivery
Sites Messages (6) • Slovakia • Since 2006 In ALICE • Serves ALICE/ATLAS/HONE • Upgrades planed for air-conditioning and power, later CPU and disk, expert support is a concern • Reliable and steady resources provision • RDIG • RRC-KI (toward T1): Hardware (CPU/Storage) rollout, service installation and validation, personnel is in place, pilot testing with ATLAS payloads • 8T2s + JRAF + PoD@SPbSU, deliver ~5% of the ALICE Grid resources, historically support all LHC VOs • Plans for steady growth and sites consolidation • As all others, reliable and smooth operation
Victory! I work at a T1! How are you so cool under pressure?
At the end • On behalf of all participants: • Thank you Renaud and thanks to the CCIN2P3 crew for the flawless organization
…and the future • There is still a visit of the computing centre in the afternoon for the people who subscribed • The next workshop (in one year’s time) needs a host • And now to lunch…