100 likes | 214 Views
A Parameterizable FPGA Prototype of a Vector-Thread Processor. Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA. Vector Execution Unit. Control Proc. Lane 0. Lane 1. Lane 2. Lane 3. VRU.
E N D
A Parameterizable FPGA Prototype of a Vector-Thread Processor Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
Vector Execution Unit Control Proc Lane 0 Lane 1 Lane 2 Lane 3 VRU Throttle Logic Refill Unit Stride SEG SEG SEG SEG SCALE Vector-Thread Processor Key Features • 4 lanes, 4 clusters • Cluster for indexed accesses • 4 segment address generators • 4 VLDQs • VRU includes throttle logic, refill address generator
Cache Arbiter and Crossbar Seg Buf Seg Buf Seg Buf Seg Buf Tags Tags Tags Tags Data Data Data Data MSHR MSHR MSHR MSHR Memory Port Arbiter and Crossbar SCALE Cache Key Features • Two cycle hit latency • Four 8 KB banks • 32 way associative • 32B cachelines • 16B/cycle per bank • Four 16B segment buffers per bank
SCALE Prototype Chip ctrl CP0 L/S ALU byp shftr PC RF MD 4 mm Cache Bank (8KB) Cache Bank (8KB) Memory Interface / Cache Control Cache Tags Cache Bank (8KB) Control Processor Crossbar Mult Div 2.5 mm ctrl IQC ctrl IQC ctrl IQC ctrl IQC ctrl LDQ shftr shftr shftr shftr RF RF RF RF ALU ALU ALU ALU latch latch latch latch mux/ mux/ mux/ mux/ ctrl IQC ctrl ctrl IQC ctrl IQC ctrl IQC Cluster Cache Bank (8KB) Memory Unit LDQ shftr shftr shftr shftr ALU ALU RF ALU ALU RF RF RF latch latch latch latch mux/ mux/ mux/ mux/ ctrl IQC ctrl IQC ctrl IQC ctrl IQC ctrl LDQ shftr shftr shftr shftr RF RF RF RF ALU ALU ALU ALU latch latch latch latch mux/ mux/ mux/ mux/ ctrl IQC ctrl IQC ctrl IQC ctrl IQC ctrl Lane LDQ shftr shftr shftr shftr RF latch ALU RF latch ALU latch ALU RF latch ALU RF mux/ mux/ mux/ mux/ • Prototype SCALE processor in development • Control processor: MIPS, 1 instr/cycle • VTU: 4 lanes, 4 clusters/lane, 32 registers/cluster, 128 VPs max • Primary I/D cache: 32 KB, 4x128b per cycle, non-blocking • DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s) • Estimated 10 mm2 in 0.18μm, 400 MHz (25 FO4) • Cycle-level execution-driven C++ microarchitectural simulator • Detailed VTU and memory system model
Scale Prototype Board • Single Xilinx Virtex-II FPGA • Configured via direct JTAG connection or SystemACE • Multiple Memory Chips • Six Micron DDR2 SDRAMs • Two Micron Mobile SDRAMs • One Micron RLDRAM • One Samsung SRAM • Two Logic Analyzer connections • Multiple separate power islands • Attached to custom test baseboard • Sixteen independently measurable power supplies • Byte-serial connection to a Linux PC
Module Placement • Reduce the risk of the final custom chip implementation • Allow early rapid prototyping of many of the system interactions • Provide a parameterizable prototype for architectural experiments
Status • Completed Work • Single-issue seven-stage pipeline MIPS processor core • Mapped to the board and passes our MIPS verification test suite • Will form the SCALE control processor • DDR2 memory controllers • Tested in isolation using simple memory traffic generators • Work in progress • Cache subsystem • Vector-thread unit
Advantages of Using an FPGA • Rapid full system simulation of a large variety of designs • Allows extensive characterization of the design space • Parameterization allows exploration of various tradeoffs • Cache parameters and replacement policies • Prefetch strategies • DRAM access scheduling policies and power-down modes • DRAM types (e.g., DDR2 vs. Mobile DRAM) • Fast emulation system for SCALE software development • Allows thorough debugging before going to silicon