510 likes | 852 Views
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs. Alex Brant Advisor: Guy Lemieux University of British Columbia. Outline. Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary. Motivation - 1. FPGA Overlays
E N D
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs Alex Brant Advisor: Guy Lemieux University of British Columbia
Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary
Motivation - 1 • FPGA Overlays • FPGA designs that can be further programmed by the user • What are the benefits? • Ease of use (simpler languages, tools, etc.) • Optimized for particular problem domains • Open access to architecture & CAD • User-configured logic added to fixed FPGA bitstream • Dynamic reconfiguration on any device • Portability between vendors and devices
Motivation - 2 Fine Grain Overlay – ZUMA • FPGA-like architecture • Compatible with VTR CAD tools • “Virtual” FPGA for portability of designs • Open source for research and applications • Implements fine grain part of MALIBU architecture • Generic implementation has high area overhead • Overcome by utilizing low level FPGA resources, implementing more efficient structures
Motivation - 3 Coarse Grain Overlay – CARBON • Array of time-multiplexed ALUs • Fast compile • High density • Efficient mapping of word oriented circuits • Implements coarse grain part of MALIBU • Time-multiplexing limits overall performance • Performance gained using overclocking with error tolerance (CARBON-Razor)
Contributions • Area efficient implementation of fine grain routing and logic with LUTRAMs • Area efficient 2-stage local routing network and configuration controller • Extension of Razor error tolerance from pipelined processors to 2D processing arrays • Design of an overclockable coarse grain FPGA overlay with in-circuit error correction
Publications • ZUMA: An Open FPGA Overlay Architecture, Alexander Brant and Guy G.F. Lemieux (FCCM 2012) • Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew, Alexander Brant, Ameer Abdelhadi, Aaron Severance, Guy G.F. Lemieux (FPT 2012) • CARBON-Razor: An Error-Tolerant Coarse Grain FPGA (in preparation)
Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary
FPGA Architecture • Implements any logic function
MALIBU Architecture • Hybrid coarse/fine grain FPGA • Time-multiplexed ALU (CG) combined with FPGA cluster • CG passes data to neighbors through memories
MALIBU Hybrid FPGA • CGs are run on fast system clock (e.g. > 1GHz) • System clock / Schedule length = User clock rate • Advantages: • Greater density from time-multiplexing • Ability to trade-off between area and speed • Compiles up to 300x faster than normal FPGA • Better performance for word-oriented circuits
Razor Timing Error Tolerance • Works with feed-forward pipeline circuits • Detects timing errors by capturing data a second time with a delayed clock • Tolerates errors by stalling pipeline one cycle
Razor Timing Error Example • Data captured in main FF
Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch
Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch • Main FF and Shadow latch are compared
Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch • Main FF and Shadow latch are compared • If different, shadow data loaded to main FF, pipeline is stalled
Razor Timing Error Example • Data captured in main FF • Fraction of cycle later, data captured by shadow latch • Main FF and Shadow latch are compared • If different, shadow data loaded to main FF, pipeline is stalled • If not, pipelining proceeds normally
Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary
ZUMA Overlay • Island style FPGA architecture, implemented on an FPGA • Initially implemented in generic Verilog • High area overhead, 125+ host LUTs for each ZUMA LUT (eLUT) • Area efficiency improvements: • Implementation of routing and logic with FPGA LUTRAMs • Design of efficient 2-stage local interconnect
ZUMA Layout One tile of ZUMA Architecture
Details - LUTRAM Reprogrammable LUTRAM in Xilinx and Altera Devices
Details – LUTRAM Multiplexer LUTRAM can implement larger MUXs than a normal LUT, need no extra configuration memory 6-LUT, configured as a 6-to-1 MUX in RAM mode 6-LUT, configured as a 4-to-1 MUX 6-LUT
Details – Local Routing Crossbar Two-Stage (I+N) x (k*N) crossbar used in ZUMA Logic Cluster
Results • Both Xilinx and Altera versions implemented • Our generic version is 125-150 LUTs per eLUT • Area overhead as low as 40 Host LUTs per eLUT with improvements • Compared to previous work (vFPGA) on 4-LUT host, overhead reduced 3x with same parameters
Outline • Motivation • Contributions • Prior Work • ZUMA FPGA Overlay • CARBON-Razor Overlay • Summary
CARBON Overlay • FPGA implementation of MALIBU CG • Modifications to support FPGA block RAMs • Critical Path is Memory to ALU to Memory
CARBON-Razor • Razor is applied to the CARBON overlay • Error tolerance on memory to memory critical path • How to do it: • Shadow registers apply to CARBON memories • CARBON schedule 1-3 extra timeslots for error recovery • Stall propagation extend from 1D pipeline (Razor) to 2D array (CARBON)
CARBON-Razor Memory • Shadow register paired with RAM • Stratix memory mode allows read-back of previously written data
2D Error Propagation • Can’t propagate errors to entire chip fast enough • We can propagate it one tile per cycle • Error propagation logic can then combine multiple errors into one stall region
2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 0
2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 1 1 0 1 1
2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 2 2 1 2 2 1 0 1 2 1 2
2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 3 2 3 3 2 1 2 2 1 0 1 3 2 1 2
2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 0 1 3 2 1 2
2D Error Propagation Example • Error at tile at cycle 0 • Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 0 1 3 2 1 2
Stall Propagation Logic • When an error is detected at a CG: • Instruction schedule stalls • Memories in CG load from shadow register • Any writes from neighbor captured in shadow register • Next cycle: • Schedule resumes • Neighbor’s write performed from shadow register • 4 neighbors stall, unless they stalled last cycle • Stall region continues in expanding diamond shaped wave
Carbon Schedule Extension • We add 1-3 cycles of slack to schedule • Allows margin of safety • Speedup determined by difference in FMAX and schedule length • If no hard deadline is needed (eg. when used as compute accelerator), average extension of schedule can be used to find speedup FMAX-Razor * SLBase FMAX-Base * SLRazor Speedup =
Results • Performance compared between CARBON and CARBON-Razor for 4 benchmarks • Maximum performance found by pushing clock speed and shadow register delay • Average increases to 14% with no hard deadline
Contributions • Area efficient implementation of FPGA routing and logic with LUTRAMs • Area efficient 2-stage local routing network and configuration controller • Extension of Razor error tolerance from pipelined processors to 2D processing arrays • Design of an overclockable coarse grain FPGA overlay with in-circuit error correction
Summary • Fine Grain Overlay – ZUMA • FPGA-like architecture, compatible with VTR CAD tools • High area overhead implementing fine grain structures • Overcome by utilizing FPGA resources, implementing alternate structures • Area reduced to 40 host LUTs per eLUT, 3x improvement • Coarse Grain Overlay – CARBON • Fast compile, efficient mapping of word oriented circuits • Time-multiplexing decreases overall performance • Performance gained using overclocking with error tolerance • Speedup of 13% on average compared to baseline design
CARBON Razor Timing • Shadow register latches correct data if delay is sufficient