HPC computing at CERN - use cases from the engineering and physics communities

HPC computing at CERN - use cases from the engineering and physics communities Michal HUSEJKO, Ioannis AGTZIDIS IT/PES/ES

Agenda • Introduction – Where we are now • CERN used applications requiring HPC infrastructure • User cases (Engineering) • Ansys Mechanical • Ansys Fluent • Physics HPC applications • Next steps • Q&A

Introduction • Some 95% of our applications are served well with bread-and-butter machines • We (CERN IT) have invested heavily in AI including layered approach to responsibilities, virtualization, private cloud. • There are certain applications, traditionally called HPC applications, which have different requirements • Even though these applications sail under common HPC name, they are different and have different requirements • These applications need detailed requirements analysis

Scope of talk • We contacted our user community and started to gather continuously user requirements • We have started detailed system analysis of our HPC applications to gain knowledge of their behavior. • In this talk I would like to present the progress and the next steps • At a later stage, we will look how the HPC requirements can fit into the IT infrastructure

HPC applications • Engineering applications: • Used at CERN in different departments to model and design parts of the LHC machine. • IT-PES-ES section is supporting the user community of these tools • Tools used for: structural analysis, Fluid Dynamics, Electromagnetics, Multiphysics • Major commercial tools: Ansys, Fluent, HFSS, Comsol, CST • but also open source: OpenFOAM (fluid dynamics) • Physics simulation applications • PH-TH Lattice QCD simulations • BE LINAC4 plasma simulations • BE beams simulation (CLIC, LHC etc) • HEP simulation applications for theory and accelerator physics

Use case 1: Ansys Mechanical • Where? • LINAC4 Beam Dump System • Who ? • Ivo Vicente Leitao, Mechanical Engineer (EN/STI/TCD) • How ? • Ansys Mechanical for design modeling and simulations (stress and thermal structural analysis) Use case 1: Ansys Mechanical

How does it work ? • Ansys Mechanical • Structural analysis: stress and thermal, steady and transient • Finite Element Method • We have physical problem defined by differential equations • It is impossible to analytically solve it for complicated structure (problem) • We divide problem into subdomains (elements) • We solve differential equations (numerically) for selected points (nodes) • And then by the means of approximation functions we project solution tothe global structure • Example has 6.0 Million (6M0) of mesh nodes • Compute intensive • Memory intensive Use case 1: Ansys Mechanical

Use case 1: Ansys Mechanical

Simulation results • Measurement hardware configuration: • 2x HP 580 G7 server (4x E7-8837, 512 GB RAM, 32c), 10 Gb low latency Ethernet link • Time to obtain single cycle 6M0 solution: • 8 cores -> 63 hours to finish simulation, 60 GB RAM used during simulation • 64 cores -> 17 hours to finish simulation, 2*200 GB RAM used during simulation • User interested in 50 cycles: would need 130 days on 8 cores, or 31 days on 64 cores • It is impossible to get simulation results for this case in a reasonable time on a standard user engineering workstation Use case 1: Ansys Mechanical

Challenges • Why do we care ? • Everyday we are facing users asking us how to speed up some engineering application • Challenges • Problem size and its complexity are challenging user computer workstations in terms of computing power, memory size, and file I/O • This can be extrapolated to other Engineering HPC applications • How to solve the problem ? • Can we use current infrastructure to provide a platform for these demanding applications ? • … or do we need something completely new ? • … and if something new, how this could fit into our IT infrastructure • So, let’s have a look at what is happening behind the scene

Analysis tools • Standard Linux performance monitoring tools used: • Memory usage: sar, • Memory bandwidth: Intel PCM (Performance Counter Monitor, open source) • CPU usage: iostat, dstat • Disk I/O : dstat • Network traffic monitoring: netstat • Monitoring scripts started from the same node where the simulation job is started. Collection of measurement results is done automatically by our tools.

Multi-core scalability • Measurement info: • LINAC4 beam dump system, single cycle simulation • 64c@1TB, 2 nodes of (quad socket Westmere, E7-8837, 512 GB), 10 Gb iWARP • Results: • AnsysMechnical simulation scales well beyond single multi-core box. • Greatly improved number of jobs/week, or simulation cycles/week • Next steps: scale on more than two nodes and measure impact of MPI • Conclusion • Multi-core platforms neededto finish simulation in reasonabletime Use case 1 : Ansys Mechanical

Memory requirements • In-core/out-core simulations (avoiding costly file I/O) • In-core = most of temporary data is stored in the RAM (still can write to disk during simulation) • Out-of-core = uses files on file system to store temporary data. • Preferable mode is in-core to avoid costly disk I/O accesses, but this requires increased RAM memory and its bandwidth • Ansys Mechanical (and some other engineering applications) has limited scalability • Depends heavily on solver and user problem • All commercial engineering application use some licencing scheme, which can put skew on choice of a platform • Conclusion: • We are investigating if we can spread required memory on multiple dual socket systems, or 4 socket systems are necessary for some HPC applications • There are certain engineering simulations which seem to be limited by a memory bandwidth, this has to be also considered when choosing a platform Use case 1 : Ansys Mechanical

Disk I/O impact • Ansys Mechanical • BE CLIC test system • Two Supermicro servers (dual E5-2650, 128 GB), 10 Gb iWARP back to back. • Disk I/O impact on speedup. Two configurations compared. • Measured with sar, and iostat • Applications spends a lot of time in iowait • Using SSD instead of HDD increasesjobs/week by almost 100 % • Conclusion: • We need to investigate more casesto see if this is a marginal caseor something more common Use case 3 : Ansys Mechanical

Use case 2: Fluent CFD • Computational Fluid Dynamics (CFD) application, Fluent (now provided by Ansys) • Beam dump system at PS booster. • Heat is generated inside the dump and you need to cool it in order to avoid it to melt or break because of mechanical stresses. • Extensively parallelized MPI-based software • Performance characteristics similar to other MPI-based software: • Importance of low latency for short messages • Importance of bandwidth for medium and big messages

Interconnect network latency impact • Ansys Fluent • CFD “heavy” test case from CFD group ( EN-CV-PJ) • 64c@1TB, 2 nodes of (quad socket Westmere, E7-8837, 512 GB), 10 Gb iWARP • Speedup beyond single node can be diminished because of high latency interconnect. • The graph shows good scalability for 10 Gb low latency beyond single box, and dips in performance when switched to 1 Gb for node to node MPI • Next step: Perform MPI statistical analysis (size and type of messages, computation vs. communication) Use case 2 : Fluent

Memory bandwidth impact • Ansys Fluent: • Measured with Intel PCM • SupermicroSandyBridge server (Dual E5-2650), 102.5 GB/s peak memory bandwidth • Observed “few” seconds peaks demanding 57 GB/s, during period=5s. This is very close to numbers measured with STREAMsynthetic benchmark on this platform. • Memory bandwidth measured with Intel PCM at memory controller level • Next step: check impact of memory speed on solution time Use case 2 : Fluent

Analysis done so far • We have invested our time to build first generation of tools in order to monitor different system parameters • Multi-core scalability (Ansys Mechanical) • Memory size requirements (Ansys Mechanical) • Memory bandwidth requirements (Fluent) • Interconnect network (Fluent) • File I/O (Ansys Mechanical) • Redo some parts • Westmere4 sockets -> SandyBridge 4 sockets • Next steps: • Start performing detailed interconnect monitoring by using MPI tracing tools (Intel Trace Analyzer and Collector)

Physics HPC applications • PH-TH: • Lattice QCD simulations • BE LINAC4 plasma simulations: • plasma formation in the Linac4 ion source • BE CLIC simulations: • preservation of the Luminosity in time, under the effects of dynamic imperfections, such as vibrations, ground motion, failures of accelerator components

Lattice QCD • MPI based application with inline assembly in the most time-critical parts of the program • Main objective is to investigate: • Impact of memory bandwidth on performance • Impact of interconnection network on performance (comparison of 10 Gb iWARP and Infiniband QDR)

BE LINAC4 Plasma studies • MPI based application • Users requesting system with 250 GB of RAM for 48 cores. • Main objective is to investigate: • Scalability of application beyond 48 cores for a reason of spreading memory requirement on more cores than 48

Clusters • To better understand requirements of CERN Physics HPC applications two clusters have been prepared • Investigate Scalability • Investigate importance of interconnect, memory bandwidth and file i/o • Test configuration • 20x Sandy Bridge dual socket nodes with 10 Gb iWARP low latency link • 16x Sandy Bridge dual socket nodes with Quad Data Rate (40Gb/s) Infiniband

Next steps • Activity started to better understand requirements of CERN HPC applications • The standard Linux performance monitoring tools give us a very detailed overview of system behavior for different applications • Next steps are to: • Refine our approach and our scripts to work at higher scale (first target 20 nodes). • gain more knowledge about impact of interconnection network on MPI jobs

Thank you Q&A

HPC computing at CERN - use cases from the engineering and physics communities