220 likes | 401 Views
Trends and Perspectives for HPC infrastructures. Carlo Cavazzoni , CINECA. outline. HPC resource in EUROPA (PRACE) Today HPC architectures Technology trends Cineca roadmaps ( toward 50PFlops ) EuroExa project.
E N D
Trends and Perspectivesfor HPC infrastructures Carlo Cavazzoni, CINECA
outline • HPC resource in EUROPA (PRACE) • Today HPC architectures • Technology trends • Cineca roadmaps (toward 50PFlops) • EuroExa project
The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through participating centers throughout Europe. Available resources are announced for each Call for Proposals.. European Tier 0 Peer reviewed open access PRACE Projects (Tier-0) PRACE Preparatory (Tier-0) DECI Projects (Tier-1) Tier 1 Local National Tier 2
TIER-0 System, PRACE regular calls CURIE (GENCI, Fr), BULL Cluster, Intel Xeon, Nvidia cards, Infiniband network FERMI (CINECA, It) & JUQUEEN (Juelich, D), IBM BGQ, Powerprocessors, custom 5D torus net. MARENOSTRUM (BSC, S), IBM DataPlex, Intel Xeonnode, Infiniband net. HERMIT (HLRS, D), Cray XE6, AMD procs, custom 3D torus net. 1PFLops SuperMUC (LRZ, D), IBM DataPlex, Intel Xeon Node, Infiniband net..
HPC Architectures Hybrid: • Server class processors: • Server class nodes • Special purpose nodes • Accelerator devices: • Nvidia • Intel • AMD • FPGA two model Homogeneus: • Server class node: • Standar processors • Special porpouse nodes • Special purpose processors
Networks Standard/switched: Infiniband Special purpose/Topology: BGQ CRAY TOFU (Fujitsu) TH Express-2 (Thiane-2)
Programming Models fundamental paradigm: Message passing Multi-threads Consolidated standard: MPI & OpenMP New task based programming model Special purpose for accelerators: CUDA Intel offload directives OpenACC, OpenCL, Ecc… NO consolidated standard Scripting: python
Dennardscalinglaw(downscaling) The core frequency and performance do not grow following the Moore’s law any longer L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P Increase the number of cores to maintain the architectures evolution on the Moore’s law The power crisis! new VLSI gen. old VLSI gen. L’ = L / 2 V’ = V / 2 F’ = F * 2 D’ = 1 / L2 = 4D P’ = P do not hold anymore! Programming crisis!
Moore’s Law Economic and market law Stacy Smith, Intel’s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore’s Law race. The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic beauty of Moore’s Law.” And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. “We are projecting similar kinds of improvements in cost out to 10 nanometers,” he said. So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel’s co-founder in the mid-1960s. From WSJ Itisallabout the numberofchips per Si wafer!
But! 14nm VLSI 0.54 nm Si lattice 300 atoms! There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).
What about Applications? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001
Architectural trends Peak Performance Moore law FPU Performance Dennardlaw NumberofFPUs Moore + Dennard App. Parallelism Amdahl's law
HPC Architectures Hybrid, but… twomodel Homogeneus, but… What 100PFlops system wewillsee … myguess IBM (hybrid) Power8+Nvidia GPU Cray (homo/hybrid) with Intel only! Intel (hybrid) Xeon + MIC Arm (homo) onlyarm chip, but… Nvidia/Arm (hybrid) arm+Nvidia Fujitsu (homo) sparc high density low power China (homo/hybrid) with Intel only Roomfor AMD console chips
Chip Architecture Mobile, Tv set, Screens Video/Image processing Strongly market driven Intel ARM NVIDIA Power AMD New archto compete with ARM LessXeon, but PHI Main focus on low power mobile chip Qualcomm, Texas inst. , Nvidia, ST, ecc new HPC market, server maket GPU alone willnot last long ARM+GPU, Power+GPU Embedded market Power+GPU, only chance for HPC Console market Still some chance for HPC
Tier 1 CINECA Procurement Q2014 Requisiti di alto livello del sistema Potenza elettrica assorbita: 400KW Dimensione fisica del sistema: 5 racks Potenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops
Tier 1 CINECA Requisiti di alto livello del sistema Architettura CPU: Intel XeonIvyBridge Numero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento Numero di server: 500 - 600, ( Peakperf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops ) Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU Architettura GPU: Nvidia K40 Numero di GPU: >500 ( Peakperf = 700 * 1.43TFlops = 1PFlops ) Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU
Tier 1 CINECA Requisiti di alto livello del sistema Vendor identificati: IBM, Eurotech DRAM Memory: 1GByte/core Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata Memoria non volatile locale: >500GByte SSD/HD a seconda del costo e dalla configurazione del sistema Cooling: sistema di raffreddamento a liquido con opzione di free cooling Spazio disco scratch: >300TByte (providedby CINECA)