220 likes | 356 Views
Beyond the Arithmetic Constraint: Depth-Optimal Mapping of Logic Chains in LUT-based FPGAs. FPGA 2008. by Michael T. Frederick and Arun K. Somani michael.t.frederick@gmail.com , arun@iastate.edu Iowa State University Ames, IA USA. Artificial Constraint. High performance. Squeaky wheel.
E N D
Beyond the Arithmetic Constraint: Depth-Optimal Mappingof Logic Chains in LUT-based FPGAs FPGA 2008 by Michael T. Frederick and Arun K. Somani michael.t.frederick@gmail.com, arun@iastate.edu Iowa State University Ames, IA USA
Artificial Constraint High performance Squeaky wheel Overspecialized Motivation • HDL macros preserve chains through synthesis, technology mapping, clustering, placement, and routing • Circuit delay is usually dominated by programmable interconnect - carry chains are one attempt to address • 0ps wire delay adjacent cell interconnect in Stratix • Carry logic delay is 58ps • LUT delay is 366ps • Programmable routing delay is typically on range 300ps-2.0ns • About 70% of circuit delay due to routing, 30% due to LUTs • Next to no delay due to carry chain connections
Logic Chains:The Generalization of Arithmetic • (K-1)-LUT operating mode • Carry-select addition • Separate (K-1)-LUT functions (e.g. cout and sum) • Altera Stratix & Cyclone • K-LUT operating mode • Single K-LUT function • Altera Stratix LUT Chain • Carry chain reuse cell (Frederick et al. 2007)
Technology Mapping for Chains • FlowMap (Cong et al. 1994) • First polynomial time, optimal logic depth solution • Assumption is that LUTs connected solely with routing • Arbitrary net delay FlowMap (Cong et al. 1994) • Static estimated net delay • Not adapted to chain characteristics • Carry chain reuse (Frederick et al. 2007) • Post-technology map heuristic methods for assigning carry chain nets
Redefining depth • Minimize routing depth instead of logic depth • Chain net – negligible delay (~0ps) • Routing net – non-zero delay (>0ps) • Chains are a series of depth increasing nodes • A Boolean node increasing logic depth, but not routing depth • Exclusivity – chain net is an exclusive connection between adjacent LEs • ChainMap is inspired by, and incorporates FlowMap concepts • Labeling • Ascertain optimal routing depth of each node • Identify depth increasing nodes • Compute minimum routing and logic height cut for each node • Mapping - Form LUTs defined by minimum height cuts • Duplication- Comply with exclusivity constraint • Relaxation (optional) - Reduce the number of node duplications
Preliminaries • Cut - A partition of nodes in a DAG • LUT(t) – K-feasible set of nodes determined by a minimum routing height cut that implements t • Routing height of node t is determined by cut height and depth increasing node • Logic height of node t is determined by cut height • Objective 1: minimize routing depth • Objective 2: minimize logic depth LUT(t)
LabelingStep 1: Find a depth increasing node • Construct the cone Nt of t in DAG N • Exclude all non-predecessors of t • Join all PIs with global source s • Create P, the predecessors of t where g(u)=p • tis included in P • d can be any node in P • Special case: when t=d then LUT(t)=P • Partition P using DFS(d,P) • Pd consists of chain nodes • ~Pd consists of nodes that join LUT(t) • Special case: when t=d then Pd=P, ~Pd=Ø • If no edge exists between Pd-{d} and ~Pd then d is a valid depth increasing node
LabelingStep 2: Isolate depth increasing node • Collapse all of Pd into d’ and all of ~Pd into t’ • Special case t=d results in collapse of all Pd into d’=t’ • Precludes a cut that bisects the nodes in Pd or ~Pd • If Nt’ can be formed, then there is a valid depth increasing node
LabelingStep 3: Find min-height K-feasible cut • Construct the flow residual graph Nt’’ • Split all nodes except {s, t} • Assign bridging edge capacity of 1, all others ∞ capacity • Apply Max-flow, Min-cut • If find more than K augmenting paths, can’t find K-feasible cut and LUT(t)=t • If less than or equal to K augmenting paths, DFS can be used to find cut set • Any node in P can be d • l(u), is used to select the minimum height logic depth from among the K-feasible minimum height routing cuts
Mapping • Using minimum height cuts, generate a mapping solution • Exactly as in FlowMap • Create set T, containing all POs • For each t є T, use its K-feasible cut to create LUT(t)=t’ • Update T to be (T-{t}) υ input(t’) • Repeat process until all nodes in T are PIs
Duplication:The Exclusivity Constraint • If either u or v has K inputs, u and v must be equivalent • If |input(u) υ input(v)| < K, u and v can compute different functions • u must generate a chain output and v a routing output, or vice versa • input(v) does not contain u and input(u) does not contain v
Duplication • Traverse N in reverse topological order • If t violates exclusivity, it is duplicated as t’ • Remap LE {u,v} є output(t) • Area bounded by O(n2) • Worst case expansion occurs when each chain from s to its descendent POs is duplicated • For a chain with n nodes, O(n)xO(n) possible duplications
Relaxation • Process nodes in reverse topological order (PO to PI) • For each node t, identify LEs {u,v} for all u,v єoutput(t) • If only 1 LE, leave alone • If more than 1 LE, relax all of the shortest paths to POs • Designed to target long logic chains, typified by arithmetic
Experimental Flow • Quartus II – HDL elaboration to VQM format (Malhotra et al. 2004) • SIS – synthesis, technology decomposition, and chain extraction of VQM netlist (Sentovich et al. 1992) • DMIG, FlowPack – K-bounding, logic network reduction (Chen et al. 1992) • Four different experimental flows • forget - don’t use chains after elaboration • before - insert chains after synthesis and before ChainMap • after - insert chains after ChainMap • normal - insert chains after FlowMap
Optimal Results • Speedup increases as routing becomes more costly • Area often not feasible • cfft, K=5, before, 2.07x • before, forget outperform after • before, forget closely mirror each other (<5%) • Growth rate with G:L of K=[4,5] nearly the same, but K=6 slightly lower due to fewer nets
Relaxed Results • before, forget closely mirror each other (<5%) • before, forget trend differently for K=[4,6] • As G:L and K increase, ability to mask relaxed nets decreases • Vulnerability of relaxation technique for higher G:L • after is always increasing • Few opportunities for tool-generated chains, yielding fewer relaxations • HDL chains more resilient against increased G:L
A closer look… • Optimal routing depth always as good or better than traditional • Ubiquitous speedup for optimal solutions • In some cases, relaxation technique hinders performance • LUT consumption often reduced
Conclusions & Future Work • Formally, a generic logic chain is a subnetwork of adjacent nodes with equal routing depth and increasing logic depth • L c N s.t. g(uj) = g(ui), l(uj)=l(ui)+1, for all ui,ujє L • Polynomial O(n3) time identification of generic logic chains • Eliminates the need to preserve HDL • Finish full design flow experiments, including clustering, placement, and routing • Develop more creative & effective relaxation techniques • Explore architectures that more effectively support generic chains • Develop CAD tools that no longer depend on HDL preservation
Questions? Thanks!
References • Altera. Stratix Series User Guides. www.altera.com. • J. Cong and Y. Ding. FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(1):1-12, 1994. • A. Farrahi and M. Sarrafzadeh. Complexity of the lookup-table minimization problem for fpga technology mapping. IEEE Transactions On Computer-Aided Design Of Integrated Circuits And Systems, 13(11):1319-1332, 1994. • L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton Univ. Press, Princeton, NJ, 1962. • M. Frederick and A. Somani. Non-arithmetic carry chains for reconfigurable fabrics. In Proceedings of the 15th International Conference on Computer Design, pages 137-143, October 2007. • S. Malhotra, T. Borer, D. Singh, and S. Brown. The quartus university interface program: enabling advanced fpga research. In Proceedings of the 2004 IEEE Int'l Conference on Field-Programmable Technology, pages 225-230, Dec. 2004. • OpenCores. www.opencores.org. • E. Sentovich, K. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P. Stephan, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. Sis: A system for sequential circuit synthesis. Technical Report UCB/ERL M92/41, EECS Department, University of California, Berkeley, 1992. • S. Singh, J. Rose, P. Chow, and D. Lewis. The effect of logic block architecture on fpga performance. Journal of Solid-State Circuits, 27:281-287, March 1992.
Abstract Look-up table based FPGAs have migrated from a niche technology for design prototyping to a valuable end-product component and, in some cases, a replacement for general purpose processors and ASICs alike. One way architects have bridged the performance gap between FPGAs and ASICs is through the inclusion of specialized components such as multipliers, RAM modules, and microcontrollers. Another dedicated structure that has become standard in reconfigurable fabrics is the arithmetic carry chain. Currently, it is only used to map arithmetic operations as identified by HDL macros. For non-arithmetic operations, it is an idle but potentially powerful resource. This work presents ChainMap, a polynomial-time delay-optimal technology mapping algorithm for the creation of generic logic chains in LUT-based FPGAs. ChainMap requires no HDL macros be preserved through the design flow. It creates logic chains, both arithmetic and non-arithmetic, in an arbitrary Boolean network whenever depth increasing nodes are encountered. Use of the chain is not reserved for arithmetic, but rather any set of gates exhibiting similar characteristics. By using the carry chain as a generic, near zero-delay adjacent cell interconnection structure an average optimal speedup of 1.4x is revealed, and an average relaxed speedup of 1.25x can be realized simultaneously with a 0.95x LUT utilization decrease.
Given: Show: Case 1: Case 2: Case 3: Proof of increasing routing depth