310 likes | 328 Views
Explore methods to utilize regulartiy in datapath circuits for area minimization and increasing logic density in FPGA designs, focusing on synthesis optimization.
E N D
Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto {yeandy, lewis, jayar}@eecg.utoronto.ca
Motivation: Datapath Regularity • Larger FPGAs • Larger applications on FPGAs • More datapath logic in larger applications • Datapath logic is highly regular • Utilize regularity to improve logic density
Utilizing Datapath Regularity • A new datapath-oriented FPGA • New CAD tools supporting the new FPGA • Synthesis • Packing • Placement • Routing • This talk focuses on synthesis
Background: Datapath-oriented FPGA • Architected to utilize datapath regularity • Architectural features • Capture regularity using special logic blocks • Increase logic density by coarse grain routing
L L L L L Logic cluster S S Switch box Coarse grain routing tracks Fine grain routing tracks Background: FPGA Overview Routing Channels
BLE BLE BLE BLE BLE BLE BLE BLE BLE MUX LUT BLE BLE BLE BLE Local Routing Network BLE DFF BLE BLE BLE BLE BLE M BLE A Basic Logic Element (BLE) A Subcluster Background: Logic Cluster Subcluster 4 Subcluster 3 Subcluster 2 Subcluster 1
L L L L L Logic cluster S S Switch box Coarse grain routing tracks Fine grain routing tracks Background: FPGA Overview Routing Channels
Logic Cluster Sub- cluster Sub- cluster Sub- cluster Sub- Cluster M Switch Box M M M M M Fine Grain Routing M Coarse Grain Routing Background: Coarse Grain Routing Tracks
Datapath Synthesis • Synthesis • The first step in a fully automated CAD flow • Transforms high level descriptions into logic • Conventional synthesis (flat synthesis) • Minimizes area and delay metrics • Destroys datapath regularity • Datapath synthesis • Preserves datapath regularity • Supports downstream CAD tools
Datapath Representation • Datapath circuits are represent by netlists of datapath components (VHDL or Verilog) • Datapath component library • Multiplexers • Adders/subtracters • Shifters • Comparators • Registers • Each component consists of identical bit-slices
Hard Boundary Hierarchical Synthesis • Optimize within the boundaries of bit-slices • Keep identical bit-slices identical • Optimized 15 datapath circuits from Pico-java processor using Synopsys [sun] • Good regularity • Bad area - 38% area inflation • FPGA architecture – increase logic density • Need a better synthesis tool
Causes of Area Inflation • Examined circuits to determine the causes • Constraint of preserving bit-slice boundaries • Common sub-expressions exist across bit-slices • Harder to discover in datapath synthesis • Constraint of preserving datapath regularity • Identical bit-slices have different external connections • Some bit-slices have more optimization opportunities • Missing optimization opportunities if one has to keeping all bit-slices identical
Enhanced Module Compaction Netlist of Datapath Components Manual Operation Word-level Optimization Module Compaction Bit-slice Netlist I/O Optimization Flat Synthesis & Optimization Within Bit-slice Boundaries Netlist of Synthesized Bit-slices
Word-level Optimization • Done manually and will be automated • Optimizes across bit-slice boundaries • Uses the functionality of each datapath component to create optimization opportunities • Two are performed • Multiplexer tree collapsing • Operation reordering • More in the future
Multiplexer Tree Collapsing • Datapath circuits contain multiplexers in a tree topology • Collapses several multiplexers in a multiplexer tree into a single multiplexer • Collapsing operation creates common sub-expressions • Extracts common expressions out of multiple bit-slices to save area
A A S1 S1 rl S2 S2 R FF FF rl – random logic An Example mux1 mux2
Operation Reordering • Transforms result selection into operand selection • Accepts the transformation if resulting in smaller area
a c b d b0 d0 s mux mux a0 c0 cin0a cin0b a0 b0 c0 d0 a b c d s s0 + + + e e cin0 sum carry sum carry mux cout0a cout0b sum carry cout0 s0 e0 e0 An Example
Module Compaction • Merges bit-slices into larger bit-slices • Based on connectivity between datapath components • Larger bit-slices have more optimization opportunities for flat synthesis • Avoids merging based on carry chains • Similar to the algorithm proposed by Koch
An Example FA0 FA1 FA2 FA3 FA4 mux0 mux1 mux2 mux3
Bit-slice I/O Optimization • Granularity of bit-slice I/O optimization, m • Breaks datapath components into m-bit wide chunks • m bit-slices are kept identical to each other • Allows some bit-slices in a datapath component to be optimized more than others
Bit-slice I/O Optimization • Converts bit-slice I/O signals into internal signals if all m bit-slices meet an optimization criteria • More optimization opportunities for flat synthesis • Four types of I/O optimizations • Constant absorption • Feedback absorption • Duplicated input absorption • Unused output absorption
Experimental Results • Fifteen benchmark circuits • From the Pico-java processor • Synthesized into 4-LUTs and DFFs • Experiments • Area • Regularity • Area against m (the granularity of bit-slice I/O optimization)
Area • m (granularity of bit-slice I/O optimization) = 4 • Compare datapath synthesis with flat synthesis
Regularity • m (granularity of bit-slice I/O optimization) = 4 • Two terminal connections captured by • 4-bit wide buses • 4-bit wide control groups
S4 S4 S4 S3 S3 S3 S2 S2 S2 S1 S1 S1 Regularity A 4-bit wide bus A 4-bit wide control group
Regularity Results • 94% of LUTs remain in regular datapath components
Granularity (m) Vs. Area • Higher m (the granularity of bit-slice I/O optimization) • Keeps more bit-slices identical • Preserves more regularity • Higher area cost
Conclusion • Presented a datapath-oriented FPGA architecture • Presented an enhanced module compaction algorithm • Empirically demonstrated the area efficiency of the algorithm • 3%-8% area inflation • Good regularity • 48% two terminal connections are in 4-bit wide buses • 35% two terminal connections are in 4-bit wide control groups