180 likes | 309 Views
Reconfigurable Computing: A First Look at the Cray-XD1. Craig Ulmer. Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963. September 1, 2004. Outline. Reconfigurable computing refresher Progress update Cray XD1 Architecture
E N D
Reconfigurable Computing:A First Look at the Cray-XD1 Craig Ulmer Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963 September 1, 2004
Outline • Reconfigurable computing refresher • Progress update • Cray XD1 • Architecture • General message passing • Reconfigurable Computing and the XD1
a[i+2] a[i] a[i+1] * + Z -1 + Reconfigurable Computing • Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; }
First Year Progress • Computation (Underwood SNL/NM) • Double-precision Floating Point Cores • Communication • Multi-gigabit Transceiver (MGT) interface • Gigabit Ethernet work • Early application experiments • Simplified isosurfacing • Networked pattern matching
Peak Floating-Point Performance From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04
S o c k e t I/F Outgoing Data Queue T C P I/F IP Header CRC MAC Framer Pad Tx Rocket I/O MGT CRC Gen Timeout Monitor SEQ Gen Ping Reply ARP Cache ARP Reply MGT Control CRC CRC GT_Ethernet_2 ACK Monitor Ping ARP Decode Align Rx Incoming Data Queue SNL_OpenGigE SNL_OpenTOE Connecting FPGAs to the Network Fabric • Modern FPGAs feature multi-gigabit transceivers • Experimented with GigE, Myrinet 2000, and IB • Implemented TCP Offload Engine (TOE) in hardware • Working on OpenTOE and OpenGigE cores
NDA Notice We do have an NDA with Cray Canada The XD1 we have on loan is an early Beta system
Cray XD1 Overview • Dense MP system • 12 AMD Opterons on 6 blades • 6 Xilinx Virtex-II/Pro FPGAs • InfiniBand-like interconnect • 6 SATA hard drives • 4 PCI-X slots • 3U Rack
HT: 6.4 GB/s “HT”: 3.2 GB/s HT: 3.2 GB/s “Einstein” Chip RAP NI 4xIB: 2 GB/s RapidArray Fabric (24 4x IB Ports) Individual Blade DDR Memory Opteron Opteron DDR Memory RAP NI RapidArray Fabric (24 4x IB Ports) * All data rates are aggregates(i.e., 3.2 GB/s = 1.6 GB/s + 1.6 GB/s)
Message Passing • MPICH 1.2.5 • Latency: 2.25 μs • Bandwidth: 1.3 GB/s (82% of HT-IB link) • RapidArray message layer • Open source • MP, RDMA • Global address space MPI Bandwidth 1.6GB/s HT PCI-X 133 Bandwidth (Million Bytes/s) Message Size (Bytes)
System Administration • Active manager • Synchronize each node’s OS • Partition blade functionality • Control access rights • Embedded processor • Monitors health (heartbeats) • Can restart nodes • Issues?
User-defined Circuits Host HT QDR2 I/F 2MB SRAM HT HT I/F FPGA Port QDR2 I/F 2MB SRAM RAP NI 1.6+1.6GB/s QDR2 I/F 2MB SRAM Fabric Port QDR2 I/F 2MB SRAM Net IB FPGA 1.6+1.6GB/s Connecting to the “Einstein” Accelerator
Host Memory CPU RNG NI FPGA Example: Random Number Generator • Monte Carlo app in need of good random numbers • Mersenne twister • Implemented in FPGA • FPGA pushes to host memory • 301 vs 101 Million Integers/s • ~1.2 GB/s
Reconfigurable computing FPGA in memory Fast local memory Other accelerators ClearSpeed Global address space Opteron limits (40b PA) Vendor lock-in Incompatible network All-in-one box? Current NI is a bottleneck Density vs. Reliability Value-added features General XD1 Comments Not-so-Good Good
Friendly Users? • We have a month left on evaluation • Could use feedback from other users http://cdulmer.ran.sandia.gov/xd1cdulmer@sandia.gov