280 likes | 598 Views
Reconfigurable Computing. Jongeun Lee. Fall 2013. Part I: FPGA. FPGA. Flexibility + parallelism Spatial computing FPGA mapping flow. Logic Elements of FPGA. Look up table 3-LUT, 4-LUT, .. Logic block (function block). FPGA architecture. 2D array of logic blocks. Interconnect.
E N D
Reconfigurable Computing Jongeun Lee Fall 2013
FPGA • Flexibility + parallelism • Spatial computing • FPGA mapping flow
Logic Elements of FPGA • Look up table • 3-LUT, 4-LUT, .. • Logic block (function block)
FPGA architecture • 2D array of logic blocks
Interconnect • Nearest neighbor • More complex routing structure • Connect block • Switch box
More efficient interconnects • Longer-length wires • Hierarchical
Extended logic elements • Fast carry chain • Eg. simple 4-bit full adder • Multiplier • RAM • Processor blocks
Configuration • SRAM • Fast, infinite reconfiguration • Power (leakage), volatile, large cell, • requires extra storage or hardware to program at boot-up • Flash memory • Nonvolatile, smaller cell, lower static power • Limited write cycle lifetime, slower write speed, requires charge pumps on chip • Antifuse • Very small, very low prop delay, no static power, immune to soft error • One time programmable
Altera Stratix • Logic architecture • Logic Element (LE) • Logic Array Block (LAB) 1 LAB = 10 LEs, carry chains, control signals, local interconnection
Altera Stratix • Interconnect • Hierarchical: local (within LAB) + neighboring blocks + general (horiz./vertical channels) • RAM blocks • M512: 32x18-bit • M4K: 128x36-bit • M-RAM: 4Kx144-bit • Configurable as: 1-port, 2-port, shift-register, FIFO, ROM table • May have parity bits, registered inputs/outputs • DSP blocks • One 36x36-bit, four 18x18-bit, or eight 9x9-bit mult. (+accum.)
Altera Stratix • Routing architecture • MultiTrack: R4, R8, R24, C4, C8, C16 • Direct connection
Xilinx Virtex-II Pro • Logic architecture • CLB = 4 slices + 2 tri-buffers • Slice = two 4-LUTs, 2 regs, carry logic, wide-function muxes, gates • RAM: block SelectRAM+ • 18Kb, true 2-port • Multipliers • 18x18-bit mult. • CPU: PowerPC 405-D5 • 300 MHz
Xilinx Virtex-II Pro • Routing architecture • Segmented, hierarchical • 24 long lines that span the full height and width of the device • 120 hex lines that route to every third or sixth block away in all four directions • 40 double lines that route to every first or second block away in all four directions • 16 direct connect routes that route to all immediate neighbors • 8 fast-connect lines in each CLB that connect LUT inputs and outputs
Questions • What is the appropriate granularity for the reconfigurable fabric? • Should the reconfigurable fabric be instantiated as a separate coprocessor or integrated as a functional unit?
Terminology • RPF (reconfigurable processing fabric) • Static vs. dynamic • Kernels (= virtual instruction configurations, VICs) • Fine-grained vs. coarse-grained • Tight-coupling vs. loose-coupling
Garp’s nonsymmetrical RPF • One row has • One control PE • Communicate with ext (irq, memory) • 23 logic PEs • 2-bit granularity • Limited wire network • Configuration • 6,144 bytes (for 32 rows) • 384 words on 128b bus • Partial and dynamic • Compiler
PipeRench • Configuration • pipelined • partial and dynamic • virtual pipeline stages vs. physical pipe stages • Architecture • cyclic dep allowed within row (= stage) • full crossbar bet’n stages • well suited for stream processing • Programming • dataflow intermediate language (DIL)
Virtual pipeline stages • virtual pipeline stages of an application • light gray blocks -- configuration of pipeline stage • dark gray blocks -- execution • mapping of virtual pipeline stages to physical pipeline stages • shown left: physical pipeline stages • labeled: virtual pipeline stage number
RPF integration functional unit • many ways exist • no agreement on terminology • tightly vs loosely is only relative tightly coupled? loosely coupled?
Functional unit • RFU • just another FU • extends ISA • e.g. • Chimaera • PRISC
Coprocessor • RaPiD • independent or integrated • more loosely coupled • Chameleon’s RCP • PPF can access DMA, processor, programmable I/O
Hybrid type • ADRES
Implementation results Processor area breakdown Power breakdown in acceleration mode
How they fare? • lack of significant market success to date • reconfigurable computing is still an area of significant ongoing research and commercial interest • For example, Rapport Inc.'s Kilocore design is a commercial derivative of the PipeRencharchitecture • As of 2007, Rapport was offering 256 PE components organized as 16 stripes, each composed of 16 8-bit PEs, and it has plans to expand its offerings to components containing thousands of PEs. • SRP, a derivative of ADRES, is included & used in Samsung’s AP