240 likes | 358 Views
Post Placement C -Slow Retiming for Xilinx Virtex FPGAs. UC Berkeley Reconfigurable Architectures, Systems, and Software (BRASS) Group ACM Symposium on Field Programmable Gate Arrays (FPGA) February 2x, 2003 http://www.cs.berkeley.edu/~nweaver/cslow.html. Outline.
E N D
Post Placement C-Slow Retiming for Xilinx Virtex FPGAs UC Berkeley Reconfigurable Architectures, Systems, and Software (BRASS) Group ACM Symposium on Field Programmable Gate Arrays (FPGA) February 2x, 2003 http://www.cs.berkeley.edu/~nweaver/cslow.html
Outline • “Automatically Double Your Throughput” • “You paid for those registers, here’s how to use them” • Retiming and C-slow Retiming • The transformation • C-slow Retiming and the Virtex FPGA • The target • Retiming 3 Benchmarks • The tests Automatic C-Slow Retiming for Virtex FPGAs
Retiming and Repipelining • Retiming • Automatically moving registers to minimize the clock period • Benefits limited by the number of registers • Algorithm developed by Leiserson et al • Repipelining • Adding registers to the front or back • Let retiming then move them around • But What About Feedback Loops? • Retiming and repipelining are of limited benefit when you have feedback loops Automatic C-Slow Retiming for Virtex FPGAs
C-Slow Retiming • Replace every register with a sequenceof C registers. • With more registersretiming can break the design into finer pieces • Again proposed by Leiserson et al, to meet systolic slowdown • Semantic altering transformation • But resulting semantics are predictable and useful • Ideal: C-slow in synthesis, retime after placement • Our prototype: C-slow and retime after placement Automatic C-Slow Retiming for Virtex FPGAs
Design Semantics After C-Slowing • Design operates on Cindependent data streams • Data streams are externally interleaved on round robin basis • Semantics apply to designs with Task Level Parallelism • Encryption • Counter (CTR) mode works on independent blocks • Sequence matching • Compare sequence vs database • C-slowing improves throughput but adds latency and registers Automatic C-Slow Retiming for Virtex FPGAs
X F4 F3 4-LUT F2 XB F1 BX C-slowing, Retiming, and the Virtex FPGA • Every 4-LUT has associated register • Register can, almost always, be used independently of the LUT • LUTs can act as clocked shiftregisters (SRL16s) • Used in our AES hand-benchmark • Not used in our tool • Many designs have low register utilization • Excess of registers available in unoptimized designs • Retiming best performed with/after placement • Xilinx placement operates on mapped slices • Need net delay information for better results Automatic C-Slow Retiming for Virtex FPGAs
2 1 1.6 1.1 1 2 1.6 1.6 1.1 1.1 1 .xdl 1.4 1.3 .xdl 1.4 1.4 1.3 1.3 1 2 2 2 1 2 2 1 2.2 1 2.2 2.2 0.9 0.9 0.9 Sketch of Tool’s Operation • Convert .ncd to .xdl after placement • Load design into graph representation • Replace registers with edge annotations to represent registers • Replace every single register with C registers • Compute costs based on delay model • Retime • Convert edge annotations back to instance registers • Write out .xdl, convert to .ncd • Route Placer Router Automatic C-Slow Retiming for Virtex FPGAs
Experiment 1:How Good is the Tool? • Tool is a simple prototype • Manhattan distance delay estimate • No attempt to minimize flip-flops • Basic flip-flop allocation • Two benchmarks: AES and Smith/Waterman • Hand mapped • (optionally) hand placed • (optionally) hand C-slowed and retimed • Our Best hand AES implementation • 1.3 Gb/s • <800 Slices, 10 BlockRAMs • $10 part, Spartan II-100 Automatic C-Slow Retiming for Virtex FPGAs
Experiment 1:AES, Automatically Placed • Just retiming is of no benefit • Automatic C-slowing very effective • But could do even better Automatic C-Slow Retiming for Virtex FPGAs
Experiment 1:Smith/Waterman, Automatically Placed • Again, just retiming is of no benefit • C-slowing highly effective • Within 7% of hand-built implementation Automatic C-Slow Retiming for Virtex FPGAs
Experiment 1:Comments • Just retiming is of no benefit • Both designs limited by single cycle feedback loops • C-Slowing very effective • Able to automatically nearly double throughput • Hand implementations more than doubled throughput • Reasonable numbers of additional registers • Limitations of prototype tool: • Flip-flop allocation routines could be better • Some AES hand benchmarks used SRL16 delay chains • Simple is pretty good • Relatively simplistic implementation gets reasonably close to hand-mapped performance Automatic C-Slow Retiming for Virtex FPGAs
Experiment 2: Retiming LEON • Can we automatically C-slow a large, synthesized design? • Leon 1: A synthesized , GPLed SPARCcompatible microprocessor core [1] • 5 stage pipeline, integer only • Modify register file to use BlockRAMs • BlockRAMs are used as negative edge devices • Remove caches, I/O, etc • Synthesize, using Symplify with CEs disabled • Edit EDIF to replace Sets/Resets • Retime and C-slow with prototype tool • Prototype tool converts BlockRAMs to positive edge • C-slow a microprocessor core... • Get an interleaved multithreaded architecture [1] Leon 1, by Jiri Gaisler, http://www.gaisler.com/leonmain.html Automatic C-Slow Retiming for Virtex FPGAs
Experiment 2:Results • Retiming alone worked surprisingly well • 2-slowing very effective • 3-slowing hit diminishing returns 6132 Luts for all designs Automatic C-Slow Retiming for Virtex FPGAs
Experiment 2:Comments • Retiming alone worked surprisingly well • Tool automatically converted BlockRAMs to positive-edge clocking and rebalanced the pipeline • 2-slowing very effective • Effectively doubled the initial throughput • NO slowdown in latency over initial design because retiming was effective without C-slowing • Used more many registers, but fewer registers than LUTs • 3-slowing hit diminishing returns • Too many registers required combined with poor register allocation poor performance Automatic C-Slow Retiming for Virtex FPGAs
Conclusions: • C-slow retiming is very effective • "Automatically double your throughput" • Benefits: More throughput • Costs: More Flip Flops, worse latency • Post-placement retiming appropriate • Independent Flip Flop usage critical • Have delay model for interconnect as well as logic • Some room for improvement • Faster/Better implementation • Minimize Flip Flop usage as well as delay • Use SRL16s • Better placement of Flip Flops • Experience suggests more Flip Flops/LUT would be useful Automatic C-Slow Retiming for Virtex FPGAs
Backup Slide: Why Not Use (Current) Synthesis Tools? • Many synthesis tools support retiming, but with caveats: • ONLY works for synthesized items • AES and Smith/Waterman didn't use synthesis • Can't automatically C-slow • Can't retime through memory blocks • Can't accurately guesstimate interconnect delay before placement • >½ of the delay is the interconnect • Can't effectively scavenge unused flip-flops before placement • Xilinx placement operates on slices, not luts Automatic C-Slow Retiming for Virtex FPGAs
Backup Slide: Why the limitations on total speedup? • Absolute maximum • Interconnect + LUT + Flip-Flop • Practical maximums • Too many flip-flops to allocate • “Only” one flip-flop per LUT available • Flip-flop allocation poor • Quick and dirty greedy heuristic • Works well for mild C-slowing • Fails with highly aggressive C-slowing • Tool doesn’t minimize flip-flops • Critical path is defined by the single worst path • Tool uses “Cheap and dirty” interconnect delay model Automatic C-Slow Retiming for Virtex FPGAs
Addr Addr Dout Thread Counter Din Dout Din WE WE (Backup Slide) :Design Restrictions to Enable C-slowing • Resets and Clock Enables • Convert to explicit logic • Memories • Increase by a factor of C • Add high bits of addr to provide round-robin access • Every stream sees an independent memory • Global Set/Reset • Convert to individual resets • Still highly restrictive • Interleave/deinterleave IO • Requires external logic • No asynchronous sets/resets Automatic C-Slow Retiming for Virtex FPGAs
Scrap Image Automatic C-Slow Retiming for Virtex FPGAs
Scrap Image 2- Automatic C-Slow Retiming for Virtex FPGAs
Addr Addr Dout Thread Counter Din Dout Din WE WE Scrap Image 3 Automatic C-Slow Retiming for Virtex FPGAs
Scrap Image 4 Automatic C-Slow Retiming for Virtex FPGAs
1.6 1.1 1.6 1.1 1 2 1 1 1.4 1.3 1.4 1.3 1 2 1 2 1 2 2.2 2.2 1 0.9 0.9 Scrap 5 Automatic C-Slow Retiming for Virtex FPGAs
1.6 1.1 1.4 1.3 2.2 0.9 Scrap 6 Automatic C-Slow Retiming for Virtex FPGAs