Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing: FPGAs for Ultrascale ScienceSandia National Laboratories Craig Ulmer SNL/CA cdulmer@sandia.gov Keith Underwood SNL/NM SOS-8 Workshop April 14, 2004

Motivation: CPU Efficiency Trend While CPU performance has been increasing.. ..processing efficiency has been decreasing. Efficiency: MFLOPS/MHz/Mtransistors Efficiency Processors

Looking Ahead • For commodity clusters, should we be nervous? • Significant increases in technology effort • Diminishing returns • Should we depend on CPU manufacturers for HPC? • Sandia has many HPC interests • Investigate computing alternatives and accelerators • FPGAs: Modern Reconfigurable Computing

Outline Reconfigurable computing Use FPGAs to accelerate computations Strategy and examples Approaches to scientific computing Challenges for ultrascale science Double-precision floating-point performance System integration and network aspects

Reconfigurable Computing Background “Soft Hardware”

Software Soft-Hardware Hardware π A B C D + x Reconfigurable Computing x xor Fetch z-1 + Decode Registers x result Execute x / + xor • General-Purpose • CPU • Easily reprogrammed • Low cost • Fundamental bottlenecks • Field Programmable • Gate Arrays (FPGAs) • Reconfigurable hardware • Medium cost • Speedup potential • Application-Specific Integrated Circuit (ASIC) • Not modifiable • High cost • Extremely fast Memory Writeback Computing Spectrum

LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB Reconfigurable Hardware Devices Devices that can be programmed to emulate hardware circuitry • Tile architecture • Logic blocks (LBs) • Routing elements • Field-Programmable Gate Arrays • Fine granularity • LBs are bit-level operators • Commercial trend • Coarse granularity • LBs are ALUs, FPUs • QuickSilver, Pact XPP, ClearSpeed

SRAM SRAM A Internal SRAM B (0-15) SRAM SRAM B Common Acceleration Techniques • Processing concurrency • Hardware pipelines • Custom memory interactions • Partial evaluation Key: Designing in Hardware

Reconfigurable Computing for Ultrascale Science:HPC Strategy and Examples Enhancing HPC Performance

HPC Strategy at Sandia for RC • RC resources work best as accelerators in HPC • Clusters are inexpensive & work well for many applications • Add RC devices to enhance performance • Port key portions of algorithms to RC hardware • Focus on hotspots and inner loops • Move data to/from FPGAs in pipelined fashion

Scientific Computing Examples • Pattern recognition • ATLAS project at CERN • Reduced 2500 CPUs to 120 nodes with FPGAs • Visualization • Vizard II project at University of Tübingen • Direct volume rendering for 5123 datasets • Molecular dynamics (MD) • Preliminary work at Los Alamos National Laboratory • 20 Cells in an FPGA yields 5.69 GFLOPS • Computational fluid dynamics (CFD) analysis for jet engines • Smith and Schnore at GE Global Research

LANL, Academia Industry Keith Underwood SNL/NM Craig Ulmer SNL/CA Challenges • Hard to program • Hardware design • Must be significant parallelism • Limited chip capacity • Lack of HPC building blocks • Our users need DP-FP • System integration • How do we add to our clusters?

Reconfigurable Computing for Ultrascale Science:Double-Precision Floating-Point Cores Addressing the need for HPC building blocks

Double-Precision Floating-Point Cores • Floating point has been historical weakness for FPGAs • FP cores consume significant amounts of hardware • Previous FPGAs lacked capacity • Significant improvements in recent commercial FPGAs • Increased capacity, faster clocks, and better building blocks • Keith Underwood at SNL/NM • Re-evaluating FP performance in FPGAs • Constructing high-speed DP-FP cores

Peak Performance Results From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04

Double-Precision Multiply Performance Trends

Reconfigurable Computing for Ultrascale Science:Networking Aspects Addressing capacity and system integration issues

FPGA Fabric Rocket I/O MGT PIN PIN Rocket I/O MGT PIN PIN Rocket I/O MGT PIN PIN Data Exchange:Multi-Gigabit Transceivers (MGTs) • How do we rapidly move data into/out of FPGA? • Xilinx Virtex-II/Pro FPGA has MGTs • Channel data rates: 3.125 Gbps • Up to 24 channels • V2/ProX: twenty 10Gbps channels • Configured for different physical layers • InfiniBand, FC, GigE, 10GigE • S-ATA, PCI-Express, HT

Increase Raw Capacity Connect FPGAs together MGTs provide fat pipes Cables, not PCB traces System Integration Connect FPGA to SAN Implement NI in FPGA FPGA is global resource Computational Circuits Computational Circuits Computational Circuits Computational Circuits Channel Channel Channel CPU Computational Circuits NI Tx Channel Channel Channel NIC Rx CPU System Area Network FPGA FPGA NI FPGA FPGA Tx Rx NIC FPGA CPU NIC Importance of MGTs

Computational Circuits NI TCP Core GigE IP Core MGT Tx Rx SNL OpenTOE NI FPGA Recent Sandia Work: SNL OpenTOE • At Sandia we are interested in connecting FPGAs to SANs • Main target: InfiniBand • Must implement network protocols for reliable transfer • Initial work: GigE and TCP • Implemented GigE core and basic TCP offload engine

Concluding Remarks • Improvements in commercial FPGAs make RC attractive • FPGAs provide better sustained performance than CPUs • FPGA performance growing faster than Moore’s Law • Near-term strategy: accelerator-based approach • Offload key operations into hardware • Sandia National Labs investigating RC for HPC acceleration • Enabling scientific computing through fast DP FP cores • Addressing system integration/capacity issues via network

Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories

Presentation Transcript

Epetra Concepts Data management using Epetra Michael A. Heroux Sandia National Laboratories

Introduction to Reconfigurable Computing

Mike Hightower and Anay Luketa-Hanlin Sandia National Laboratories Albuquerque, New Mexico

IT Service Catalog

The 7 th Ultrascale Visualization Workshop

An Overview of Trilinos Michael A. Heroux Sandia National Laboratories

Reconfigurable Computing for DSP

Trilinos Overview and Future Plans Michael A. Heroux Sandia National Laboratories

Alan P. Zelicoff, MD Sandia National Laboratories

A survey on Reconfigurable Computing for Signal Processing Applications

A Compact and Efficient FPGA Implementation of DES Algorithm

Combustion Science Data Management Needs

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California

Heiko Schröder, 2003

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Reconnect ‘04 Introduction to PICO

Computing Hopf Bifurcations in Large-Scale Problems

Reconfigurable Computing

Automatic Generation of Systolic Array Designs For Reconfigurable Computing