ATLAS Computing Alessandro De Salvo CCR Workshop, 18- 5 - 2016

ATLAS ComputingAlessandro De SalvoCCR Workshop, 18-5-2016

2015 Data Taking Collected data 2015 Collected data 2016 > 92% efficiency > 93% efficiency Pile-Up 2015 2

LHC Upgrade TimelineThe Challenge to Computing Repeatsperiodically! 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 HLT: Readout rate 0.4 kHz 30 fb-1 HLT: Readout rate 1 kHz 150 fb-1 300 fb-1 HLT: Readout rate 5-10 kHz 3

The data rate and volume challenge 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 HLT: Readout rate 0.4 kHz 30 fb-1 • In ~10 years, increase by factor 10 the number of events per second • More events to process • More events to store HLT: Readout rate 1 kHz 150 fb-1 300 fb-1 HLT: Readout rate 5-10 kHz 4

The data complexitychallenge 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 … 2037 HLT: Readout rate 0.4 kHz 30 fb-1 • In ~10 years, increase by factor 10 the luminosity • More complex events HLT: Readout rate 1 kHz 150 fb-1 300 fb-1 HLT: Readout rate 5-10 kHz 5

Pile-up challenge • Higher pileup means: • Linear increase of digitization time • Factorial increase of Reconstruction time • Slightly larger events • Lots of more memory • The average pile-up will be: • <mu>=30 in 2016 • <mu>=35 in 2017 • … • <mu>=200 in HL-LHC (10 years) 6

Simulation – full and fast Speed Precision Time/Event distribution Log scale! 7

Simulation • Simulation is CPU intensive • Integrated Simulation Framework • Mixing of full GEANT & fast simulation within an event • Work in progress, target is 2016 8

Reconstruction Reconstruction is memory eager and requires non negligible CPU (40% w.r.t. simulation, 20% of ATLAS CPU usage) Athena memory Profile AthenaMP: multi-processing reduces the memory footprint MP Serial 2GB/core Running Jobs Reconstruction Time (s/event) Time Single Core Multi Core Code and algorithms optimization largely reduced CPU needs in reconstruction[4] 9

Derivations • New analysis model for Run 2: group data format DAOD made using a train model • Production of 84+ DAOD species by 19 trains on the grid • 24h after data reconstruction at Tier-0 • Vital for quick turn around and robustnessof analyses • >= 2015 ATLAS results based on DAODs 10

Hardware Trend and implications • Clock speed stalled (bad) • Transistor density keeps increasing (good) • Memory/core diminishes • WLCG: 2GB/core • XeonPhi: 60 cores, 16GB • Tesla K40: 2880 cores, 16GB • Multiprocessing (AthenaMP) will not be sufficient anymore • Future Framework • MultiThreading and Parallelism 11

Future Framework Requirements Group • Established between Trigger/DAQ and Computing • Examine needs of a future framework to satisfy both offline and HLT use cases • Reported in December • https://cds.cern.ch/record/1974156/ Run3 multi-threaded reconstruction cartoon: Colours represent different events, shapes different algorithms; all one process running multiple threads 12

Timeline • Want to have a multi-threading framework in place for Run3 • Allows experience with running multi-threaded before HL-LHC • Thus most development should be done by the start of LS2 • This is now only 2 years away • At the end of Run2 we should have a functional multi-threaded prototype ready for testing 13

Leveraging opportunistic resources # of cores for ATLAS running jobs Almost 50% of ATLAS production at peak rate relies on opportunistic resources 450k AWS burst Today most opportunistic resources are accessible via Grid interfaces/services pledge 100k Enabling utilization of non-Grid resources is a long term investment (beyond opportunistic use) 01/05/14 01/03/15 14

Grid and off-Grid resources • Global community did not fully buy into Grid technologies, which were very successful for us • We have a dedicated network of sites, using custom software and serving (mostly) the WLCG community • Finding opportunistic/common resources: • High Performance Computer centers • https://en.wikipedia.org/wiki/Supercomputer • Opportunistic and commercial cloud resources • https://en.wikipedia.org/wiki/Cloud_computing • You ask for resources though a defined interface and you get access and control of a (virtual) machine, rather than a job slot on the Grid 15

(Opportunistic) Cloud Resources We invested a lot of effort in enabling usage of Cloud resources The ATLAS HLT farm at the CERN ATLAS pit (P1) for example was instrumented with a Cloud interface in order to run simulation: Sim@P1 #events vs time T2s 20M events/day T1s 4 days sum CERN P1 (approx 5%) P1 07/09/14 04/10/14 The HLT farm was dynamically reconfigured to run reconstruction on multicore resources (Reco@P1). We expect to be able to do the same with other clouds 16

HPCs High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that others can not use (e.g. single core job slots) The ATLAS production system has been extended to leverage HPC resources 24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores). ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources) Running cores vs time 10,000 running cores Mira@ARGONNE: Sherpa Generation using 12244 nodes with 8 threads per node, so 97,952 parallel Sherpa processes. The goal is to validate as many workflows as possible. Today approximately 5% of ATLAS production runs on HPCs 17

Challenges in HPCs utilisation Blue Gene: PowerPC architecture Restrictive site policies, inbound/outbound connectivity, #jobs/#threads 18

Networking • Networking is the one item that will most probably continue its progress & evolution further.. • In terms of bandwidth increase. • In terms of new technologies 19

Content Delivery Networking T. Wenaus 20

Storage Endpoints • 75% of Tier 2 available storage in ~30 sites Large disparity in size of Tier 2s • More efficient to have larger and fewer storage end-points 2 possible categories • ’Cache based’ & ‘large’ Tier 2s • Some Tier 2s are already larger than some Tier 1s • Storage endpoints < 300 TB • Do not plan an increase of storage (pledges) in next years or aggregate with other(s) end point(s) to form an entity larger than 300TB 21

How possibly will the ATLAS Computing look like? • Even today, our CPU capacities fit into one super-computing center. • The future: allocations at HPC centers, commercial and academic clouds, computing on demand, Grid technologies With the network evolution ‘local’ storage becomes re-defined Consolidate/Federate storage to few endpoints Data Caching and Remote data access The main item that does not have many solutions and gives a severe constraint is our data storage: We need reliable and permanent storage under ATLAS control. 22

Conclusions • Computing model hasbeenadapted to Run2 • 2015 data processing and distributionhasbeen a success • 2016 data takinghasbeenstartedsmoothly • No big changesenvisaged for Run3 • In the future more efficientusage of opportunisticresources and reorganization of the global storagefacilities 23

ATLAS Computing Alessandro De Salvo CCR Workshop, 18- 5 - 2016