180 likes | 275 Views
RAMP Blue: A Message Passing Multi-Processor System on the BEE2. Andrew Schultz and Alex Krasnov P.I. John Wawrzynek. Introduction.
E N D
RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov P.I. John Wawrzynek
Introduction • RAMP Blue is an initial design driver for the RAMP project with the goal of building an extremely large (~1000 node) multiprocessor system on a cluster of BEE2 modules using soft-core processors • Goal of RAMP Blue is to experiment and learn lessons on building large scale computer architectures on FPGA platforms for emulation, not performance • Current system has 256 cores on 8-module cluster running full parallel benchmarks, easy upgrade to 512 cores on 16-modules once boards are available
RDF: RAMP Design Framework • All designs to be implemented within RAMP are known as the target system and must follow some restrictions defined by the RDF • All designs are composed of units with well defined ports that communicate with each other over uni-directional, point-to-point channels • Units can be as simple as a single logical gate, or more often a larger unit such as a CPU core or cache • Timing between units is completely decoupled by the channel
RAMP Blue Goals • RAMP Blue is sibling to other RAMP design driver projects: 1) RAMP Red: Port of existing transactional cache system to FPGA PowerPC cores 2) RAMP Blue: Message passing multiprocessor system with existing, FPGA optimized soft core (MicroBlaze) 3) RAMP White: Cache coherent multiprocessor system with full featured soft-core • Blue also to run off-the-shelf, message passing, scientific codes and benchmarks (provide existing tests and basis for comparison) • Main goal is to fit as many cores as possible in the system and have the ability to reliably run code and change system parameters
RAMP Blue Requirements • Built from existing tools (RDL not available at time) but fit RDF guidelines for future integration • Require design and implementation of gateware and software for implementing MicroBlaze with uClinux on BEE2 modules • Sharing of DDR2 memory system • Communication and bootstrapping from user FPGAs • Debugging and control from control FPGA • New on-chip network for MicroBlaze to MicroBlaze communication • Communication on-chip, FPGA to FPGA on board, and board to board • Completely new double precision floating point unit for scientific codes
MicroBlaze Characteristics • 3-stage, RISC like architecture designed for implementation on FPGAs • Takes advantage of FPGA unique features (e.g. fast carry chains) and addresses FPGA shortcomings (e.g. lack of CAMs in cache) • Maximum clock rate of 100 MHz (~0.5 MIPS/MHz) on Virtex-II Pro FPGAs • Split I and D cache with configurable size, direct mapped • Fast hardware multiplier/divider, optional hardware barrel shifter • Configurable hardware debugging support (watch/breakpoints) • Several peripheral interface bus options • GCC tool chain support and ability to run uClinux
Memory System • Requires sharing memory channel with a configurable number of MicroBlaze cores • No coherence, each DIMM is partitioned, but bank management keeps cores from fighting with each other
Control Communication • Communication channel from control PowerPC to individual MicroBlaze required for bootstrapping and debugging • Gateware provides general purpose, low-speed network • Software provides character and Ethernet abstraction on channel • Kernel is sent over channel and file systems can be mounted • Console channel allows debugging messages and control
Double Precision FPU • Due to size of FPU, sharing is crucial to meeting resource budget • Shared FPU much like reservation stations in microarchitecture with MicroBlaze issuing instructions
Network Characteristics • Interconnect must fit within the RDF model • Network interface uses simple FSL channels, currently PIO but could be DMA • Source routing between nodes (non-adaptive, link failure intolerant) • Only links to physically fail would be board-to-board XAUI links • Topology of interconnect is full cross-bar on chip with all-to-all connection of board-to-board links • Longest path between nodes is four on-board links and one off-board link • Encapsulated Ethernet packets with source routing information prepended • Virtual cut through flow control with virtual channels for deadlock avoidance
Multiple Cores • Scaling up to mulitple cores per FPGA is primarily constrained by resources • The current evaulation cluster implements 8 cores/FPGA using roughly 85% of the slices (but only slightly more than half of the LUTs/FFs) • Sixteen cores fit on each FPGA without infrastructure (switch, memory, etc), 10-12 are maximum depending on options • Options include hardware accelerators, cache size, FPU timing, etc.
Test Cluster • Sixteen BEE2 modules with 8 cores per user FPGA and two 1 GB DDR2 DIMMs per user FPGA • Overall cluster has 512 cores, scaling up to 12 cores per FPGA and utilizing the control FPGA it is realistic to achieve 960 cores in the cluster
Software • Each node in the cluster boots its own copy of uClinux and mounts a file system from an external NFS file system • The Unified Parallel C (UPC) message passing framework was ported to uClinux • The main porting effort with UPC is adapting to its transport layer, GASnet, but this was circumvented by using the GASnet UDP transport layer • Floating point integration is achieved via modification to the GCC SoftFPU backend to emit code to interact with FPU • Within UPC the NAS Parallel Benchmark runs on the cluster • Only class “S” benchmarks can be run due to the limited memory (256MB/node)
Performance • Performance is not the key metric for success with RAMP Blue • While improving performance is secondary goal, ability to reliably implement a system with a wide range of parameters and meet timing closure in with desired resources is primary goal • Analysis of performance points out bottlenecks for incremental improvement in future RAMP infrastructure designs • Analysis of node to node network shows software (i.e. network interface) is primary bottleneck, finer grained analysis forthcoming with RDL port • Just for the heck of it, the NAS performance numbers:
Implementation Issues • Building such a large design exposed several insidious bugs in both hardware and gateware • MicroBlaze bugs in both gateware and GCC toolchain required a good deal of time to track down (race conditions, OS bugs, GCC backend bugs) • Memory errors with large designs on BEE2, still not completely understood, probably has to do with noise on power plane increasing clock jitter • Lack of debugging insight and long recompile time greatly hindered progress • Building large cluster exposed bugs caused by variation in BEE2 boards • Multiple layers of user control (FGPA, processor, I/O, software) all contribute to uncertainty in operation
Conclusion • RAMP Blue represents the first steps to developing a robust library of RAMP infrastructure for building more complicated parallel systems • Much of the RAMP Blue gateware is directly applicable to future systems • Many important lessons were learned about required debugging/insight capabilities • New bugs and reliability issues were exposed in BEE2 platform and gateware to help influence future RAMP hardware platforms and characteristics for robust software/gateware infrastructure • RAMP Blue also represents the largest soft-core, FPGA based computing system ever built and demonstrates the incredible research flexibility such systems allow • Ability to literally tweak the hardware interfaces and parameters provide a “research sandbox” for exciting new possibilities • E.g. add DMA and RDMA to networking, arbitrarily tweak network topology and experiment with system level paging and coherence, etc.