1 / 34

State-of-the-Art and Road Map of CINECA HPC Infrastructure

Explore the cutting-edge HPC infrastructure of CINECA, including the Eurora and Fermi clusters, and discover the future roadmap for data-centric HPC services.

loisfox
Download Presentation

State-of-the-Art and Road Map of CINECA HPC Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CINECA HPC Infrastructure: state of the art and road map Carlo Cavazzoni, HPC department, CINECA

  2. Installed HPC Engines Eurora (Eurotech) PLX, (IBM DataPlex) FERMI, (IBM BGQ) hybrid cluster 64 nodes 1024 SandyBridge cores 64 K20 GPU 64 Xeon PHI coprocessor 150 TFlops peak 10240 nodes 163840 PowerA2 cores 2PFlops peak Hybrid cluster 274 nodes 3288 Westmere cores 548 nVidia M2070 (Fermi) 300TFlops peak

  3. FERMI @ CINECAPRACE Tier-0 System Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s Available for ISCRA & PRACE call for projects

  4. The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through participating centers throughout Europe. Available resources are announced for each Call for Proposals.. European Tier 0 Peer reviewed open access PRACE Projects (Tier-0) PRACE Preparatory (Tier-0) DECI Projects (Tier-1) Tier 1 Local National Tier 2

  5. 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips, Torus 3. Compute card: One chip module, 16 GB DDR3 Memory, 2. Single Chip Module 1. Chip: 16 P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 7. System: 20PF/s 5a. Midplane: 16 Node Cards 6. Rack: 2 Midplanes

  6. BG/Q I/O architecture IB IB PCI_E BG/Q compute racks Switch File system servers BG/Q IO IB SAN

  7. I/O drawers I/O nodes PCIe 8 I/O nodes At least one I/O node for each partition/job Minimum partition/job size: 64 nodes, 1024 cores

  8. PowerA2 chip, basic info 64bit RISC Processor Power instruction set (Power1…Power7, PowerPC) 4 Floating Point units per core & 4 way MT 16 cores + 1 + 1 (17th Processor core for system functions) 1.6GHz 32MByte cache system-on-a-chip design 16GByte of RAM at 1.33GHz Peak Perf 204.8 gigaflops power draw of 55 watts 45 nanometer copper/SOI process (same as Power7) Water Cooled

  9. PowerA2 FPU Each FPU on each core has four pipelines execute scalar floating point instructions four-wide SIMD instructions two-wide complex arithmetic SIMD inst. six-stage pipeline maximum of eight concurrent floating point operations per clock plus a load and a store.

  10. EURORA #1 in The Green500 List June 2013 What EURORA stant for? EURopean many integrated cOReArchitecture What is EURORA? Prototype Project Founded by PRACE 2IP EU project Grant agreement number: RI-283493 Co-designed by CINECA and EUROTECH Where is EURORA? EURORA is installed at CINECA When EURORA has been installed? March 2013 Who is using EURORA? All Italian and EU researchers through PRACE Prototype grant access program 3,200MFLOPS/W – 30KW

  11. Why EURORA? (project objectives) Address Today HPC Constraints: Flops/Watt, Flops/m2, Flops/Dollar. Efficient Cooling Technology: hot water cooling (free cooling); measure power efficiency, evaluate (PUE & TCO). Improve Application Performances: at the same rate as in the past (~Moore’s Law); new programming models. Evaluate Hybrid (accelerated) Technology: Intel Xeon Phi; NVIDIA Kepler. Custom Interconnection Technology: 3D Torus network (FPGA); evaluation of accelerator-to-accelerator communications.

  12. EURORA prototype configuration 64 compute cards 128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W) 16GByte DDR3 1600MHz per node 160GByte SSD per node 1 FPGA (AlteraStratix V) per node IB QDR interconnect 3D Torus interconnect 128 Accelerator cards (NVIDA K20 and INTEL PHI)

  13. Node card K20 Xeon PHI

  14. Node Energy Efficiency Decreases!

  15. HPC Service

  16. HPC Engines HPC Services HPC Workloads FERMI (IBM BGQ) Eurora (Eurotechhybrid) PLX (IBM x86+GPU) PRACE LISA Projects Agreements 0.3PFlops peak ~3500 x86 procs 548 NVIDIA GPU 20 NVIDIA Quadro 16 Fat nodes ISCRA Training Labs Industry #12 Top500 2PFlops peak 163840 cores 163Tbyte RAM Power 1.6GHz #1 Green500 0.17PFlops peak 1024 x86 cores 64 Intel PHI 64 NVIDIA K20 Data Processing Workloads FERMI PLX High througput viz Big mem DB Web serv. Data mover Data mover processing HPC Data store Workspace 3.6PByte NUBES FEC Repository 1.8PByte Tape 1.5PB Cloud serv. Web Archive FTP External Data Sources PRACE EUDAT Labs Projects HPC Cloud Nubes FEC PLX Store Network Custom IB Gbe Fibre FERMI EURORA EURORA PLX Store Nubes Infrastructure Internet Store

  17. CINECA services • High Performance Computing • Computational workflow • Storage • Data analytics • Data preservation (long term) • Data access (web/app) • Remote Visualization • HPC Training • HPC Consulting • HPC Hosting • Monitoring and Metering • … For academia and industry

  18. Road Map

  19. (data centric) Infrastructure (Q3 2014) Cloud service SaaS APP New storage External Data Sources PRACE EUDAT Other Data Sources Core Data Store Laboratories Repository 5PByte Tape 5+ PByte Internal data sources Human Brain Prj Core Data Processing Scale-Out Data Processing Workspace 3.6PByte viz Big mem DB Web serv. FERMI New analytics cluster Data mover processing X86 Cluster Web Archive FTP Analytics APP Parallel APP

  20. New Tier 1 CINECA Procurement Q3 2014 Requisiti di alto livello del sistema Potenza elettrica assorbita: 400KW Dimensione fisica del sistema: 5 racks Potenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops

  21. Tier 1 CINECA Requisiti di alto livello del sistema Architettura CPU: Intel XeonIvy Bridge Numero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento Numero di server: 500 - 600, ( Peakperf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops ) Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU Architettura GPU: Nvidia K40 Numero di GPU: >500 ( Peakperf = 700 * 1.43TFlops = 1PFlops ) Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU

  22. Tier 1 CINECA Requisiti di alto livello del sistema Vendor identificati: IBM, Eurotech DRAM Memory: 1GByte/core Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata Memoria non volatile locale: >500GByte SSD/HD a seconda del costo e dalla configurazione del sistema Cooling: sistema di raffreddamento a liquido con opzione di free cooling Spazio disco scratch: >300TByte (providedby CINECA)

  23. Roadmap 50PFlops

  24. Roadmap to Exascale(architectural trends)

  25. HPC Architectures • Hybrid: • Server class processors: • Server class nodes • Special purpose nodes • Accelerator devices: • Nvidia • Intel • AMD • FPGA two model • Homogeneus: • Server class node: • Standar processors • Special porpouse nodes • Special purpose processors

  26. Architectural trends Peak Performance Moore law FPU Performance Dennard law Number of FPUs Moore + Dennard App. Parallelism Amdahl's law

  27. Programming Models fundamental paradigm: Message passing Multi-threads Consolidated standard: MPI & OpenMP New task based programming model Special purpose for accelerators: CUDA Intel offload directives OpenACC, OpenCL, Ecc… NO consolidated standard Scripting: python

  28. But! 14nm VLSI  0.54 nm Si lattice 300 atoms! There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).

  29. Thank you

  30. Dennard scaling law(downscaling) The core frequency and performance do not grow following the Moore’s law any longer L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P Increase the number of cores to maintain the architectures evolution on the Moore’s law The power crisis! new VLSI gen. old VLSI gen. L’ = L / 2 V’ = V / 2 F’ = F * 2 D’ = 1 / L2 = 4D P’ = P do not hold anymore! Programming crisis!

  31. Moore’s Law Economic and market law Stacy Smith, Intel’s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore’s Law race. The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic beauty of Moore’s Law.” And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. “We are projecting similar kinds of improvements in cost out to 10 nanometers,” he said. So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel’s co-founder in the mid-1960s. From WSJ It is all about the number of chips per Si wafer!

  32. What about Applications? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

  33. HPC Architectures Hybrid, but… two model Homogeneus, but… What 100PFlops system we will see … my guess IBM (hybrid) Power8+Nvidia GPU Cray (homo/hybrid) with Intel only! Intel (hybrid) Xeon + MIC Arm (homo) only arm chip, but… Nvidia/Arm (hybrid) arm+Nvidia Fujitsu (homo) sparc high density low power China (homo/hybrid) with Intel only Room for AMD console chips

  34. Chip Architecture Mobile, Tv set, Screens Video/Image processing Strongly market driven Intel ARM NVIDIA Power AMD New archto compete with ARM LessXeon, but PHI Main focus on low power mobile chip Qualcomm, Texas inst. , Nvidia, ST, ecc new HPC market, server maket GPU alone willnot last long ARM+GPU, Power+GPU Embedded market Power+GPU, only chance for HPC Console market Still some chance for HPC

More Related