350 likes | 497 Views
ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs. Overview. Logic synthesis LUT Clustering LUT capacity Chortle – example technology mapper Architecture-specific optimization. Boolean network.
E N D
ECE 697FReconfigurable ComputingLecture 5Technology Mapping: Packing Logic into LUTs
Overview • Logic synthesis • LUT Clustering • LUT capacity • Chortle – example technology mapper • Architecture-specific optimization
Boolean network • A Boolean network is the main representation of the logic functions for technology independent optimizations. • Each node can be represented as sum-of-products (or product-of-sums). • Provides multi-level structure, but functions in the network need not correspond to logic gates.
primary outputs out1 = k2 + x2’ out2 = k3 + x1 k2 = x1’ x2 x4 + k1 k3 = k1 x4’ k1 = x2 + x3 primary inputs x1 x2 x3 x4 Boolean network example
Terms • Support: set of variables used by a function. • Transitive fanout: all the primary outputs and intermediate variables of a function. • Transitive fanin: all the primary inputs and intermediate variables used by a function. Transistive fanin determines a cone of logic. cone primary inputs output
x2 1 don’t care x1 0 1 1 x3 Partially-specified function
Optimizations • Simplification. • Changing the way a function is represented. • Network restructuring. • Adding and removing nodes. • Delay restructuring. • Optimizations that reduce the height of critical paths.
Partial collapsing f1 f4 F f4 f2 f3 f3 before after
Technology mapping • Cover the function:
FPGA tech mapping • Cost (number of inputs) doesn’t always increase with added functions:
FPGAs vs. custom logic • Cost metric for static gates is literal: • ax + bx’ has four literals, requires 8 transistors. • Cost metric for FPGAs is logic element: • All functions that fit in an LE have the same cost.
LUT-based logic synthesis • Find the largest logic cone that will fit into the LUT: r = q + s’ s = d’ q = g’ + h d = a + b
A C A C B D B D How much fits in a LUT? • One 2-input NAND gate frequently used for comparison. • Approximately 12 ~ 15 gates per four-input LUT. • 216 functions -> 80 after IO swapping 14 after IO inversion • 4-input determined to be optimal [Rose 1990]
Technology-Independent Logic Optimization • Improve circuit based on cost • Keep same functionality • Boolean Evaluation/decomposition • Simple factoring -> minimizing literals f = ac + ad + bc + bd g = a + b + c e = a + b g = e + c f = e(c + d)
Factorization • Based on division: • formulate candidate divisor; • test how it divides into the function; • if g = f/c, we can use c as an intermediate function for f. • Algebraic division: don’t take into account Boolean simplification. Less expensive then Boolean division.
Inv, cost 2 NAND2, cost 3 AOI-21, cost 4 Library-based Technology Mapping – MIS II • Three steps: decomposition, matching, covering • Circuit first decomposed into NAND representations • Different collections of NANDs can be implemented differently in VLSI
Cost = Cost = MIS II • Decompose into NAND-2 using Boolean techniques • Use dynamic programming to match subtrees with libraries • Choose lowest cost implementation that covers all primitives.
Tech Mapping for LUTs • Minimize total number of LUTs • Minimize the number of levels of LUTs • Many different approaches • Partitioning -> Flowmap • BDDs -> XMAP • Chortle -> Covering • Basic Xilinx tech mapping follows Chortle with modification to handle registers.
L M J K G H I D E F A B C x w y z Chortle-crf • Dynamic programming approach • Minimize # LUTs – primary goal • Minimize # input circuit root uses • Secondary goal • Operates on AND-OR circuits. Locate boundaries
With decomposition 2-LUTs Without decomp 4-LUTs Chortle-crf • Major innovation is bin packing • Simultaneously addresses decomposition and matching • Goal: Find decomposition of every node in the network that minimizes # LUTs in final circuit
Mapping Each Tree • Dynamically visit each node in the graph • Fanin nodes drive the node under evaluation Boxes -> fanin LUTs, cost is number of inputs Bins -> N input LUT (in this case 5) First Fit Decreasing /* construct 2-level decomp */ box list <- fanin LUTs sorted by size bin list <- 0 while (box list is not 0) { box <- largest LUT find bin that will contain LUT if bin doesn’t exist bin <- box /* create new bin */ else bin <- box /* pack in exisiting */
Multi-Level Decomposition • Chain LUTs together • Output of largest second level LUT connected to LUT with unused input • May need to add a new LUT • Leads to min LUTs and fanout LUT with smallest # input • This fanout LUT used as input to next stage
w u v x y w u v x z.2 y z.1 v u w x y z.1 Examples a) Fanin LUTs b) Two-level Decomposition c) Multi-level Decomposition
Optimality • For LUTs with fewer than 6 inputs Chortle will create an optimal result for subtree • Combination of sub-trees is not optimized. • Local optimizations needed to ensure global optimality. Reconvergent paths -> net drives multiple gates. Replicating logic -> creating additional fanout
Translating a Design to an FPGA • Improve 2-level decomposition to take fanout into account • Replace FFD with an exhaustive search that repeatedly invokes FFD. • Try both with and without reconvergent path and select best mapping (forced merging) • Inputs must reconverge at node being decomposed.
Reconvergent Paths • Frequently, more than one pair of fan-in LUTs share inputs • For each combination of pairs that share inputs, perform FFD. • Two-level decomp with fewest bins and smallest least filled bin retained Reconverge pair list <- all pairs of fanin LUTs with shared inputs best LUTs <- 0 for all possible pairs from pair list { merged LUTs <- copy of fanin LUTs with forced merge FFD(merged LUTs) /* best combo */ }
Maximum Share Decreasing • Exhaustive search prohibitive • Select box using following criteria • Greatest # inputs • Shares greatest # inputs with any existing bin • Shares greatest # of inputs with existing (remaining) boxes • Reduces to FFD for no input sharing • Points 2 and 3 optimize network sharing
Without Replication With Replication Node Replication • Apply replication to fanout nodes • Map without replication first • Locally decompose fanout nodes to determine savings • Ordering important
Results – Chortle-crf • 20 netlists mapped to 5-input LUTs • Reconvergence reduced LUTs by 2.7% • Replication reduced LUTs by 3.7% • Combined 14% reduction achieved • Replication exposes reconvergent paths creating additional opportunities for optimization.
Chortle-d • Minimize delay through circuit • Generally increases hardware required • Reduced logic levels by 38% • Increased # LUTs by 79% • Note most delay in FPGA in interconnect
Other Approaches • MIS-PGA • Groups inputs into LUTs • Decompose into 4-LUTs (Roth-Karp) • 47 times slower than Chortle • 14% fewer LUTs • XMAP • Represent circuit as BDDs • Effective for multiplexer based devices. • Also, BDS-PGA
1. Use network flow to partition circuit. Flowmap 2. Determine point where minimum flow achieved for minimum cut 3. Cut until LUTs of size N achieved.
FF FF Taking Flip flops into Account • FPGA devices contain fixed resources – FFs • Technology mapping should take these into account • Consider fanout nodes.
LUT Packing - VPACK • Seed BLE – choose BLE with most inputs. • Select next BLE -> BLE which shares most inputs and outputs with cluster • Continue until cluster is full or adding any BLE will overflow I -> # inputs • Hill Climbing – exceed I limit temporarily to find better minimum.
Summary • Many tech mapping algorithms exist to minimize delay/area • Chortle use dynamic programming heuristic to perform mapping • Largely a solved problem • More sophisticated techniques evaluated recently