Reconfigurable Computing: A First Look at the Cray-XD1

Reconfigurable Computing:A First Look at the Cray-XD1 Craig Ulmer Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963 September 1, 2004

Outline • Reconfigurable computing refresher • Progress update • Cray XD1 • Architecture • General message passing • Reconfigurable Computing and the XD1

Reconfigurable Computing Update

a[i+2] a[i] a[i+1] * + Z -1 + Reconfigurable Computing • Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; }

First Year Progress • Computation (Underwood SNL/NM) • Double-precision Floating Point Cores • Communication • Multi-gigabit Transceiver (MGT) interface • Gigabit Ethernet work • Early application experiments • Simplified isosurfacing • Networked pattern matching

Peak Floating-Point Performance From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04

S o c k e t I/F Outgoing Data Queue T C P I/F IP Header CRC MAC Framer Pad Tx Rocket I/O MGT CRC Gen Timeout Monitor SEQ Gen Ping Reply ARP Cache ARP Reply MGT Control CRC CRC GT_Ethernet_2 ACK Monitor Ping ARP Decode Align Rx Incoming Data Queue SNL_OpenGigE SNL_OpenTOE Connecting FPGAs to the Network Fabric • Modern FPGAs feature multi-gigabit transceivers • Experimented with GigE, Myrinet 2000, and IB • Implemented TCP Offload Engine (TOE) in hardware • Working on OpenTOE and OpenGigE cores

Cray XD1 Overview

NDA Notice We do have an NDA with Cray Canada The XD1 we have on loan is an early Beta system

Cray XD1 Overview • Dense MP system • 12 AMD Opterons on 6 blades • 6 Xilinx Virtex-II/Pro FPGAs • InfiniBand-like interconnect • 6 SATA hard drives • 4 PCI-X slots • 3U Rack

HT: 6.4 GB/s “HT”: 3.2 GB/s HT: 3.2 GB/s “Einstein” Chip RAP NI 4xIB: 2 GB/s RapidArray Fabric (24 4x IB Ports) Individual Blade DDR Memory Opteron Opteron DDR Memory RAP NI RapidArray Fabric (24 4x IB Ports) * All data rates are aggregates(i.e., 3.2 GB/s = 1.6 GB/s + 1.6 GB/s)

Message Passing • MPICH 1.2.5 • Latency: 2.25 μs • Bandwidth: 1.3 GB/s (82% of HT-IB link) • RapidArray message layer • Open source • MP, RDMA • Global address space MPI Bandwidth 1.6GB/s HT PCI-X 133 Bandwidth (Million Bytes/s) Message Size (Bytes)

System Administration • Active manager • Synchronize each node’s OS • Partition blade functionality • Control access rights • Embedded processor • Monitors health (heartbeats) • Can restart nodes • Issues?

Reconfigurable Computing and the Cray XD1

User-defined Circuits Host HT QDR2 I/F 2MB SRAM HT HT I/F FPGA Port QDR2 I/F 2MB SRAM RAP NI 1.6+1.6GB/s QDR2 I/F 2MB SRAM Fabric Port QDR2 I/F 2MB SRAM Net IB FPGA 1.6+1.6GB/s Connecting to the “Einstein” Accelerator

Host Memory CPU RNG NI FPGA Example: Random Number Generator • Monte Carlo app in need of good random numbers • Mersenne twister • Implemented in FPGA • FPGA pushes to host memory • 301 vs 101 Million Integers/s • ~1.2 GB/s

Reconfigurable computing FPGA in memory Fast local memory Other accelerators ClearSpeed Global address space Opteron limits (40b PA) Vendor lock-in Incompatible network All-in-one box? Current NI is a bottleneck Density vs. Reliability Value-added features General XD1 Comments Not-so-Good Good

Friendly Users? • We have a month left on evaluation • Could use feedback from other users http://cdulmer.ran.sandia.gov/xd1cdulmer@sandia.gov

Reconfigurable Computing: A First Look at the Cray-XD1