The New Era of Coprocessor in Supercomputing - 并行计算中协处理应用的新时代－

The New Era of Coprocessor in Supercomputing -并行计算中协处理应用的新时代－ Marc XAB, M.A. - 桜美林大学大学院 Country Manager 5/07/2013 @ BAH! Oil & Gas - Rio de Janeiro, Brazil Super Micro Computer Inc. RuaFunchal, 418. Sao Paulo – SP www.supermicro.com/brazil

Networking in Rio

Company Overview San Jose (Headquarter) Revenues: FY10 $721 M  FY11 $942 M FY12 $1B Global Footprint: >70 Countries, 700 customers, 6800 SKUs Production: US, EU and Asia Production facilities Engineering: 70% of workforce in engineering, SSI Member Market Share: #1 Server Channel Corporate Focus: Leader Energy Efficient, HPC & Application-Optimized Systems Fortune 2012 100 Fastest-Growing Companies Fremont Facility

COPROCESSOR (协处理器) • A coprocessor is a computer processor used to supplement the functions of the primary processor (the CPU). • Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, encryption or I/O Interfacing with peripheral devices. • Math coprocessor – a computer chip that handles the floating point operations and mathematical computations in a computer. • Graphics Processing Unit (GPU) – a separate card that handles graphics rendering and can improve performance in graphics intensive applications, like games. • Secure crypto-processor – a dedicated computer on a chip or microprocessor for carrying out cryptographic operations, embedded in a packaging with multiple physical security measures, which give it a degree of tamper resistance • Network coprocessor. 网络协处理器. • ……..

The Trend Indicated on Green500

Case Study – Submerged Liquid Cooling • Removed Fans and Heat Sinks • Use SSD & Updated BIOS • Reverse the handlers • Supermicro 1U (Single CPU) with two coprocessors • No requirement for room-level cooling • Operates at PUE ~ 1.12 • 25 kilowatts per rack – the breakpoint per rack (between regular air-cool and submerged cool) “Submerged Supermicro Servers Accelerated by GPUs” Cost Efficiency ~25kW Air cool Submerged liquid cool KW / rack

Tesla: 2-3x Faster Every 2 Years Maxwell 16 14 12 10 DP GFLOPS per Watt Thousands of core 8 Kepler 6 512 cores 4 Fermi T10 2 2008 2010 2012 2014

GPU Supercomputer Momentum 52 # of GPU Accelerated Systems on Top500 4x June 2012 Top500 Tesla Fermi Launched First Double Precision GPU 2008 2009 2010 2011 2012 2013

Case Study – PNNL • Expects supercomputer to rank in world's top 20 fastest machines. • Research for climate and environmental science, chemical processes, biology-based fuels that can replace fossil fuels, new materials for energy applications, etc. Supermicro FatTwin™ with 2x MIC 5110P per node

Case Study – PNNL Supermicro FatTwin™ with 2x MIC 5110P per node • Theoretical peak processing speed of 3.4 petaflops • 42 racks / 195,840 cores • 1440 compute nodes with conventional processors and Intel Xeon Phi "MIC" accelerators • 128 GB memory per node • FDR Infiniband network • 2.7 petabyte shared parallel file system (60 gigabytes per second read/write)

Programing Paradigm The Xeon Phi programming model and its optimization are shared across the Intel Xeon Made Easier CUDA(Compute Unified Device Architecture) – a parallel computing platform and programming model. CUDA provides developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Don’t Complicated

Keynotes • This is a new era of hybrid computing – heterogeneous architecture with PCI-E based coprocessor • Specialized (or application-optimized) design is required for GPU/MIC applications and HPC future scalability • There are more to come in the industry roadmap with new technologies, power management and system architecture • Configurable cooling & power for energy efficiency and performance are more and more critical • The trend towards heterogeneous architecture poses many challenges for system builder and software developers in making efficient use of the resources • Programming paradigm and its investment are important as a part of the selecting consideration

Oil and Gas/Seismic Weather and Climate • Weather • Atmospheric • Ocean Modeling • Space Sciences • Seismic imaging • Seismic Interpretation • Reservoir Modeling • Seismic Inversion HPC Coprocessor Applications Scientific Simulation & Creation Design • Computational fluid dynamics • Materials science • Molecular dynamics • Quantum chemistry • Mechanical design & simulation • Structural mechanics • Electronic Design Automation • Massively parallel architecture accelerates Scientific & Engineering Applications Data Mining • Data parallel mathematics • Extend Excel with OLAP for planning & analysis • Database and data analysis acceleration Imaging and Computer Vision Computational Finance • Medical imaging • Visualization & docking • Filmmaking & animation • Options pricing • Risk analysis • Algorithmic trading

Hybrid Computing Mainstream FatTwin™ 2-node 8 GPUs or MICs per node Density Efficiency Hybrid Computing Pioneer NVIDIA Kepler & Intel Xeon Phi supports Ultra High Efficiency GPGPU Where it started… 7U GPU Blades 20 CPUs + 20 GPUs The fastest 1U server in the world FatTwin™ 4-node 3 GPUs or MICs per node Telsa S1070 4 GPUs or MICs Workstation / 4U 2U 4-GPU 1U 4-GPU Standalone box 2U GPU w/ QDR IB onboard PCI-E x16 X9 (UP) 1U 2-GPU/MIC 1U 3-GPU 2U Twin The most powerful PSC 1U Twin™ X9 2U 6-GPU/MIC X9 (DP) 1U 4-GPU/MIC 2008 2009 2010 2011 2012 2013

Communication Between Coprocessors IB Switch IB IB Implementation Example The model used by existing CPU-GPU Heterogeneous architectures for GPU-GPU communication. Data travels via CPU & Infiniband (IB) Host Channel Adapter (HCA) and Switch or other proprietary interconnect Data transfer between cooperating GPUs in separate nodes in a TCA cluster enabled by the PEACH2 chip. Schematic of the PEARL network within a CPU/GPU cluster Source: Tsukuba University

Designing GPU/MIC Optimized Systems • Performance • PCI-e lanes arrangement, PCB placement, interconnect • Mechanical design • mounting, location, space utilization • Thermal • air flow, fan speed control, location, noise control • Power support • PSU efficiency, wattage options, power management • Number of power connectors (& location)

Summary • Coprocessor and Applications • Performance and Efficiency • Top500 & Green500 • Hybrid Computing & HPC • GPU/MIC Optimized Systems • Design Considerations • Performance • Mechanical Design • Thermal & Cooling • Power Support

Thank You! Marc XAB marc.xab@supermicro.com

Conference Puzzle How do you put an ELEPHANT in a Refrigerator ?

Conference Puzzle

The New Era of Coprocessor in Supercomputing - 并行计算中协处理应用的新时代－

The New Era of Coprocessor in Supercomputing - 并行计算中协处理应用的新时代－

Presentation Transcript

Low Cost Supercomputing

Supercomputing in Plain English Part VII: Multicore Madness

Copilot - a Coprocessor-based Kernel Runtime Integrity Monitor

Leibniz Supercomputing Centre Garching/Munich

Supercomputing in Plain English Overview: What the Heck is Supercomputing?

A History of Supercomputing

MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on Intel ® Many Integra

T I N T

Supercomputing in Plain English Applications and Types of Parallelism

The Future of Supercomputing

Hardware Cryptographic Coprocessor

Math Coprocessor

A Programmable Coprocessor Architecture for Wireless Applications

Superscalar Coprocessor for High-speed Curve-based Cryptography

Physics Applications Online Architecture FPGA Coprocessor HLT Communication

Supercomputing and Science

Supercomputing in Plain English Multicore Madness

Introduction to VI-HPS

UNICORE and the DEISA supercomputing grid

EOT PACI AN MSI Update

The Minnesota Supercomputing Institute for Advanced Computational Research (MSI)

Supercomputing for Nanoscience