400 likes | 410 Views
This article discusses the impact of new HPC architectures on code developments for material science simulations, with a focus on preparing codes for future exascale systems. The article presents a pragmatic approach to address the challenges of exascale computing and highlights the potential outcome and impact of these efforts.
E N D
New HPC architectureslandscapeand impact on code developments Carlo Cavazzoni Cineca & MaX
Enabling Exascale Transition • GOAL: “modernize” community codes and make them ready to exploit at best future exascale system for material science simulations (MAX Obj1.4) • CHALLENGE: there is not yet a solution that fits all needs, and this is in common in all computational science domains • STRATEGY: pragmatic approach based on building knowledge about exascale related problems and running proof of concepts to field test solutions, and finally deriving best practices that can consolidate in real solutions for the full applications and making their way through the public code release. • OUTCOME: New code versions with development validated, libraries and module publicly available beyond MAX, extensive dissemination activities . • IMPACT: modern codes, exploitation of today HPC systems, other applications fields as well as technology providers.
Changes in the road-map to Exa Intel’s Data Center Group GM Trish Damkroger describing the company’s exascale strategy and other topics they are talking about at the SC17 conference, she offhandedly mentioned that the Knights Hill product is dead. More specifically she said that the chip will be replaced in favor of “a new platform and new microarchitecture specifically designed for exascale.”
ExascaleHow serious the situation is? Peak Performance 10^5 FPUs in 10^4 servers Moore law 10^18 Flops Number of FPUs FPU Performance 10^4 FPUs in 10^5 servers 10^9 10^9 Flops Dennardlaw Working hypothesis ExascaleArchitectures Heterogeneus
General Consideration • Exascale is not (only) about scalability and Flops performance! • In an exascale machine there will be 10^9 FPUs, bring data in and out will be the main challenge. • 10^4 nodes, but 10^5 FPUs inside the nodes! • There is no silver bullet (so far) • heterogeneity is here to stay • deeper memory hierarchies Carlo Cavazzoni
Exascale… some guess • From GPU to specialized core (tensor core) • Specialized memory module HBM • Specialized non volatile memory NVRAM Performance modelling • Refactor code to better fit architectures with specialized HW • Avoiding WRONG TURN!
Paradigm and co-design Identify latency and throughput sub/module/class Map to HW App workflow latency Re-factor throughput knl Latency code Throughput code host Map and Overlap comm latency throughput Heterogeneus
MaXActivities Programming Paradigms Libraries Co-design Profiling Hot spots Performance issues bottlenecks Flops and watts efficiency DSL Kernel Libraries module Perf. Models New arch. Vendors New MPI and OpenMP Standards New Para. OmpSS, CUDA New more efficient code version Library/module shared Codes/community/vendor Feedback to Scientists/developers (WP1) Dissemination of best practice Schools and Workshops Collaborations e.g. CoE/FET/PRACE
Perf. modelling, results Absolute time estimate results. MnSi - bulk, 64 atoms, 14 k-points
CoE what is my target architecture? How could I cope with GPUs, many-cores, FPGAs? I like homogeneous architecture! Why should I care about heterogeneous? DSL, Kernel libraries Modularization, API Encapsulation Separation Of Concern Heterogeneity is here to stay!!! DSL Sirius CheSS (SIESTA) SDDK FFTXlib (QE,YAMBO) LAXlib (QE, ~YAMBO) FLEUR-LA (FLEUR) Kernel lib ELPA (QE,YAMBO, FLEUR) Carlo Cavazzoni
One size do not fit all 109 FPU to leverage Best algo for 1FPU /= best algo for 109FPU Implement the best algo for each scale e.g. 2 FFT and data distribution in QE 6.2 Autotuning • Choose the best at runtime
Beyond Modularization QE - Libraries FFTXlib LAXlib Other codes Mini-app
Scaling-out YAMBO A single GW calculation has run on 1000 Intel Knights Landing (KNL) nodes of the new Tier-0 MARCONI KNL partition, corresponding to 68000 cores and ~ 3 pFlop/second. The simulation, related to the growth of complex graphene nanoribbons on a metal surface, is part of an active research project combining computational spectroscopy with cutting edge experimental data from teams in Austria, Italy, and Switzerland. Simulations were performed exploiting computational resources granted by PRACE (via call 14). http://www.max-centre.eu/2017/04/19/a-new-scalability-record-in-a-materials-science-application/
Planning for Exascale Performance model, co-design, POC code re-factoring Pre-exascale exascale today Yambo @ 3Pflops Yambo @ 10-15Pflops Yambo @ 50Pflops Socket perf 3-5TFlops Socket perf 10-15TFlops Socket perf 20-40TFlops
Marconi convergent HPC solution Cloud/Data Proc. Scale Out 792 Lenovo NeXtScale servers Intel E5-2697 v4 Broadwell - 216 nodi eth x cloud HT INFN - 216 nodi eth x cloud HPC/DP - 360 nodi QDR x Tier 1 – HPC 100 Lenovo NeXtscale servers Intel E5-2630 v3 Haswell QDR + Nvidia K80 2300 Lenovo Stark servers > 7PFlops Intel SkyLake 24 cores @ 2.1GHz. 196GByte x node 3600 Intel/ lenovo servers > 11PFlops Intel PHI code name Knight Landing 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM 720 Lenovo NeXtScale servers Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. 128GByte x node Lenovo GSS + SFA12K + IBM Flash >30PByte
Cineca “sustainable” roadmaptoward exascale 5x >250PF+ >20PF ~8MW 5x 50PF+ 10PF ~4MW 2x (latency cores) 11PF+ 9PF 3.5MW 5x 1x (latency cores) 2PF 1MW 20x 10x (in total) 100TF 1MW Paradigm change Pre-exascale solid Carlo Cavazzoni
What does 5x really means? Peak Performance? Linpack? HPCG? Time to solutions? Energy to solutions? Time to Science? A combination of all of the above? We need to define the right metric!
The data centers at the Science Park ECMWF DC maincharacteristics • 2 power line up to 10 MW (onebck up of the other) • Expansion to 20 MW • Photovoltaiccells on the roofs (500 MWh/year) • Redundancy N+1 (mechanics and electrical) • 5 x 2 MW DRUPS • Cooling • 4 dry coolers (1850 kW each) • 4 groundwaterwelles • 5 refrigeratorunits (1400 kW each) • Peak PUE 1.35 / Maximum annualized PUE 1.18 Electricalsubstation (HV/MV) Outdoor Chillers + mechanics Diesel Generators ECMWF PLANTS ECMWF DC 1 ECMWF DC 2 INFN DC CINECA DC ECMWF EXP. ECMWF PLANTS INFN – CINECA DC maincharacteristics • up to 20 MW (onebck up of the other) • Possible use of CombinedHeat and PowerFuelCells Technology • Redundancystrategy under study • Cooling, still under study • dry coolers • groundwaterwelles • refrigeratorunits • PUE < 1.2 – 1.3 Outdoor Chillers Electricalplant rooms DRUPS rooms Mechanicalplant rooms POP 1 + POP 2 Switch rooms Gas storage rooms General Utilities
HPC and Verticals Value delivered to users VALUE Applications integration (Meteo, Astro, Materials, Visit, Repo, Ing. , Analytic, etc…) BigDATA Accelerated computing codesign 3D Viz Cloud Service AI HW infrastructure (clusters, storage, network, devices) Toward an End-to-end optimized infrastructure
Exascale “node”, according to Intel https://www.hpcwire.com/2018/01/25/hpc-ai-two-communities-future/
Memory! Analysis of Memory allocation, During an SCF cycle. Memory BW usage On different type of MEM! Critical behaviour Code is slowed down Need better memory access pattern communications
Exascale system Al Gara’svision for the unification of the “3 Pillars” of HPC currentlyunderway. “The convergence of AI, data analytics and traditionalsimulationwillresult in systems with broadercapabilities and configurabilityaswellas cross pollination.”
RMA MPI Intel 2017 • RMA as a substitute of Alltoall • Source code and data shared with Intel BDW36 mpi
Exascale How serious the situation is? Peak Performance 10^5 FPUs in 10^4 servers Moore law 10^18 Flops Number of FPUs FPU Performance 10^4 FPUs in 10^5 servers 10^9 10^9 Flops Dennard law Working hypothesis Exascale Architectures Heterogeneus
Exascale… some guess • From GPU to specialized core (tensor core) • Specialized memory module HBM • Specialized non volatile memory NVRAM Performance modelling • Refactor code to better fit architectures with specialized HW
Bologna Big Data Science Park Protezione civile and regional agency for development and innovation CINECA & INFN Exascale Supercomputing center ECMWF Data Center Conference and Education Center BIG DATA FOUNDATION Agenzia Nazionale Meteo University centers «Ballette innovation and creativity center» IOR biobank Enea center Competence center Industry 4.0
New Cineca HPC infrastructure design point D.A.V.I.D.E. (prototype) Marconi A4 - OPA Marconi A3 - OPA Marconi A2 - OPA Marconi A1 - OPA GSS OPA GSS Login Gateway PPI4HPC & EuroHPC Internet HBP, Eurofusion PRACE/EUDAT CNAF ViZ ETH-core + Mellanox Gateway IB + ETH-25/100Gbit tape Fibre SW Ex-PICO 5100 NFS Servers Cloud, BigData, AI, Interactive and Data processing Cluster Mellanox FDR servers FEC Servers TMS
Al Gara (Intel) the samearchitecturewill cover HPC, AI, and Data Analytics throughconfiguration, whichmeansthereneeds to be a consistent software story acrossthesedifferent hardware backends to address HPC plus AI workloads.
Exascale system Al Gara’svision for the unification of the “3 Pillars” of HPC currentlyunderway. “The convergence of AI, data analytics and traditionalsimulationwillresult in systems with broadercapabilities and configurabilityaswellas cross pollination.”
Exascale “node”, according to Intel https://www.hpcwire.com/2018/01/25/hpc-ai-two-communities-future/
D.A.V.I.D.E. Intelligenza Artificiale: Dall'Università alle Aziende - Bologna http://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-datasheet.pdf
HPC and Verticals Value delivered to users VALUE Applications integration (Meteo, Astro, Materials, Visit, Repo, Eng. , Analytic, etc…) BigDATA Accelerated computing AI Co-design 3D Viz High Through Connectors to other infrastructures procurement HW infrastructure (clusters, storage, network, devices) Toward an End-to-end optimized infrastructure
Power projection Peek Perf (DP) @ 10MW
Technical Project Goal of the procurement • New Prace Tier-0 system • Target:5x increase of system capability • Maximize efficiency (capability/w) • Sustain production for 3 years minimum • Integrated in the current infrastructure • Possibly hosted in the same data center as ECMWF 40