190 likes | 295 Views
Reconfigurable Computing: HPC Network Aspects. Craig Ulmer (8963) cdulmer@sandia.gov. Mitch Sukalski (8961) David Thompson (8963). Pete Dean R&D Seminar December 11, 2003. FPGAs are promising…. But what’s the catch?
E N D
Reconfigurable Computing: HPC Network Aspects Craig Ulmer (8963) cdulmer@sandia.gov Mitch Sukalski (8961) David Thompson (8963) Pete Dean R&D Seminar December 11, 2003
FPGAs are promising… But what’s the catch? There are three main challenges that need to be addressed in order to apply to practical, scientific computing.
RC Challenge #1: Floating Point • Most FPGAs fine grained • Floating point units are large • 32b FP occupies ~1,000 CLBs • Commercial capacity improving • 2000: 6,000 CLBs • 2003: 40,000 CLBs (Max: 220,000) • Keith Underwood at Sandia/NM • LDRD: Working on high-speed 64b floating-point cores 32b FP in Xilinx V2P7
RC Challenge #2: Design Tools • Hardware design is non-trivial • Micromanage computations, clock-by-clock • Not appropriate for most scientists • Need languages, APIs that are easy to use • Maya Gokhale at LANL • Streams-C: C-like language for HW design • Pipeline/unroll loops • Schedules access to external memory
RC Challenge #3: High-speed I/O • FPGAs have large internal computational power • How do we get data into/out of FPGA? • How do we connect to our existing HPC machines? • Mitch Sukalski, David Thompson, Craig Ulmer • LDRD: Connect FPGAs to high-performance SANs ? FPGA FPGA
Outline • Where we have been Networking FPGAs using external NI cards • Where we are going Networking FPGAs using internal transceivers • Project status Early details
Previous Work Where we’ve been..
CPU FPGA PCI Bus NIC Networking Earlier FPGAs • Previous generation of FPGAs were like blank ASICs • Configurable logic and pins • Attach a network card to an FPGA card • Communication over PCI • Examples: • Virginia Tech: Myrinet • Washington U. in St. Louis: ATM (inline) • Clemson University: Gigabit Ethernet • Georgia Tech: Myrinet
SRAM 0 SRAM 1 SRAM 2 SRAM 3 CPU CPU CPU CPU CPU CPU Control & Switching GRIM FPGA PCI FPGA FPGA FPGA RAID Ethernet GRIM Project at Georgia Tech • Add multimedia devices to cluster • Message layer connects CPUs, memory, and peripherals • Myrinet between hosts,PCI within hosts • Celoxica RC-1000 FPGA • Virtex FPGA (1M logic gates) • Four SRAM banks • PCI w/ PMC
Incoming Message Queues Outgoing Message Queues Application Data Communication Library API Memory API User Circuit API User Circuit 1 User Circuit n FPGA Organization FPGA Card Memory Frame Circuit Canvas FPGA
Page C Page Fault Function Fault Circuit E Circuit F Circuit G Message: Use Circuit F on $C0000000 Circuit X Circuit Y Lessons Learned Host CPU • Frame provides simple OS • Isolates users from board • Portability • Dynamically manage resources • Card memory • Computational circuits • PCI bottleneck • Distance between NI and FPGA • PCI difficult to work with Page A Page C SRAM 1 Page B FPGA SRAM 2 NIC
Network Features of Recent FPGAs Where we’re going…
CPU User-defined Computational Circuits NI Tx NIC Rx CPU System Area Network NI Tx NIC Rx CPU FPGA NIC FPGA Network Improvements • Recent FPGAs have special, built-in cores • High-speed transceivers, dedicated processors • Idea: Build our NI inside the FPGA • FPGA becomes a networked, compute resource • Removes the PCI bottleneck
Up to 4 PowerPC405 cores Embedded version of PPC 300-400MHz Multiple gigabit transceivers Run at 600Mbps to 3.125Gbps Up to twenty-four transceivers Additional cores Distributed internal memory Arrays of 18b multipliers Digital clock multipliers, PLLs Xilinx Virtex-II/Pro FPGA Xilinx V2P20
FPGA Fabric Rocket I/O PIN PIN Rocket I/O PIN PIN Rocket I/O PIN PIN FPGA Fabric - PIN CRC 8B/10B Encoder Tx FIFO Serializer PIN + Clock Recover CRC check - PIN Rx Elastic Buffer 8B/10B Decoder Deserializer PIN + Multi-Gigabit Transceivers: Rocket I/O • Flexible, high-speed transceivers • Can be configured to connect with different physical layers • InfiniBand, GigE, FC, 10GigE, Aurora • Note: low-level interface (commas, disparity, clock mismatches)
Why MGTs are Important • Direct connection to networks • Same chip, different network • Remove PCI from equation • Fast connections between FPGAs • Reduces analog design issues • Chain FPGAs together • Reduce pin count • Update: Virtex II/ProX • Now 2.488 Gbps – 10.3125 Gbps • Chips have either 8 or 20 transceivers 3.125 Gbps over 44” FR4 * * From Xilinx, http://www.xilinx.com/products/virtex2pro/mgtcharacter.htm
Hard PowerPC Core • PowerPC 405 • 16KB Instruction / 16KB Data caches • Real and Virtual memory modes • GCC is available • Multiple memory ports for core • On-chip memory (OCM) • Processor Local Bus (PLB) • User-defined memory map • Connect memory blocks or cores • External memory cores available PowerPC I-Cache D-Cache Processor Local Bus (PLB) On-Chip Memory (OCM) Interface
Commercial SoC Designing with cores Customize system New tools Rapidly connect cores Library of cores & buses Saves on wiring legwork System on a Chip (SoC) Xilinx Platform Studio
Current Status • Exploring V2P • New architecture, new tools • Two reference boards • ML300 (V2P7-6) • Avnet (V2P20-6) • Transceiver work • Raw transmission over fiber • Working towards IB http://cdulmer.ran.sandia.gov