280 likes | 295 Views
Bitwidth-Aware Scheduling and Binding in High-Level Synthesis. X. Cheng + , J. Cong, Y. Fan, G. Han, J. Lin, J. Xu + , Z. Zhang Computer Science Department, UCLA + Microprocessor Development and Research Center, PKU. Outline. Motivation Bitwidth-aware synthesis flow
E N D
Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng+, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu+, Z. Zhang Computer Science Department, UCLA +Microprocessor Development and Research Center, PKU
Outline • Motivation • Bitwidth-aware synthesis flow • Scheduling and binding to minimize total bits of functional units (FU) • Minimum weighted-interval-graph coloring problem for register allocation and binding • Experimental results • Conclusion
Motivation • High-level languages • Big gap between design productivity and complexity • Alleviate the design complexity • Need to produce high-quality products • Need to consider multi-bitwidth • Recent research shows there are 40% redundant bits in programs of high-level languages [Stephenson et al, SIGPLAN’00] • Hardware resource cost will be reduced with consideration of multi-bitwidth • Area is proportional to input bitwidth for adders and registers, and is proportional to the square of input bitwidth for multipliers • Wire-length is reduced accordingly • Conventional high-level synthesis only focuses on resources with uniform bitwidth
Motivational Example-Impact of Bitwidth Adders + + + + 5 18 + + + + + + 16 + + 18 16 5 + + + + 26 26 * * * * * * * * 16*4 18*6 18*6 32*16 * (3 clock cycles) + (1 clock cycle) Execution time: 8 clock cycles * * * * * * * * 24*16 24*16 32*16 16*4 Adders 30% saving 26 5 26 18 Multipliers Multipliers 32x16 18x6 32x16 24x16 31% saving
Related Works • High-level synthesis with consideration of bitwidth • ILP formulation [Constantinides et al, IEEE Electronics Letters’00] • Heuristic solution [Kum et al ’01] [Constantinides et al, DATE’01] • Split adders into 1-bit [Molina et al DAC’02] • Partially guarded computation [Choi et al, ISLPED’00] • Limitation • No consideration of interconnect delay in scheduling and binding • Interconnect delays dominate the timing in DSM tech • No optimality evaluation of proposed solutions for register allocation and binding
Outline • Motivation • Bitwidth-aware synthesis flow • Scheduling and binding to minimize total bits of functional units (FU) • Minimum weighted-interval-graph coloring problem for register allocation and binding • Experimental results • Conclusion
Bitwidth-Aware Synthesis Flow • Multiple bitwidth scheduling and binding problem • Given: (1) A DFG annotated with bitwidths, (2) a time constraint, (3) placement information of functional units, and (4) a resource IP library, where each resource type has arbitrary bitwidth configurations, each of which is associated with an area cost. • Objective: Schedule and bind the DFG into the library with consideration of interconnect delay from placement and without violating the time constraint, such that the final area of the required resources is minimized.
… … … Register file FSM FSM FSM LCC LCC LCC K cycles Island 2 cycles Global Interconnect Register File 1 cycle 1 cycle 2 cycle K cycle FSM … … … …. Hi FSM FSM Local Computational Cluster (LCC) FSM LCC LCC LCC MUL MUX ALU Cluster with area constraint Wi RDR+MCAS • One solution for multi-cycle on-chip communication • Regular Distributed Register (RDR) micro-architecture [Cong et al, ISPD’03] [Cong et al, ICCAD’03] • The whole chip is divided into an array of islands • Chose the island size such that local computation and communication in each island can be done in a singlecycle • MCAS: Architectural Synthesis for Multi-cycle Communication • Efficiently maps the behavioral descriptions to RDR uArch • Integrates architectural synthesis with physical planning • Placement information of functional units
Outline • Motivation • Bitwidth-aware synthesis flow • Scheduling and binding to minimize total bits of functional units (FU) • Minimum weighted-interval-graph coloring problem for register allocation and binding • Experimental results • Conclusion
Scheduling and Binding • Lower bound estimation of FU bitwidth for a DFG • Prior works focus on the number of FUs • Lower-bound-based simultaneous scheduling and binding • Time constrained • Consider the interconnect delay obtained from placement information given by MCAS
Lower Bound Estimation • Extend the interval-based technique of [Sharma et al, 93] to support multi-bitwidth FUs • Main idea • Compute the minimum resource requirement R(p, q) for each time interval [p,q][1,T] • The maximum of R(p, q) over all intervals is the final bitwidth lower bound
Example of Lower-bound Estimation • The minimum overlap between the multiplications, a, b, c and d, and interval [4,7] The minimum bitwidth requirementfor multipliers in interval [4, 7] • O(a18*6, 4, 7) = 1 16 + + + step1 18 5 • O(b24*16, 4, 7) = 2 18 + + 5 + step2 c * c * d * d * 26 a * a * • O(c32*16, 4, 7) = 1 step3 a * a * • O(d16*4, 4, 7) = 1 step4 18*6 18*6 • The operation bitwidths that must be executed in [4,7] is {18, 24, 24, 32, 16} 32*16 16*4 step5 + 16 b * b * step6 24*16 24*16 b * b * c * c * d * d * • Sorted: {32, 24, 24, 18, 16} step7 • The minimum bitwidth requirementfor multipliers in [4,7] will be R(4, 7)={32, 16} step8 + 16*4 26 32*16 ASAP ALAP Theorem: For any feasible scheduling, the minimum overlap between operation o and interval [p,q] is: O(o, p, q) = min{ |Lifetime_ASAP[p, q] |, | Lifetime_ALAP [p, q] | }
Area Cost • Weighted-arealowerbound of an unscheduled DFG is defined as area for adders area for multipliers a ratio weight of multiplier area over adder area • For a partially scheduled DFG, scheduling status S records the control steps for scheduled operations and feasible control steps for un-scheduled operations • A is calculated the same way, denoted as A(S)
Scheduling and Binding Algorithm-1 • Goal: Minimize the area cost of required FUs • Consider interconnect delay • Basic idea • In each step, schedule an operation at a control step such that the resulted weighted-area lower bound A(S) is kept as small as possible • How to choose an operation and one of its feasible control step add-16: feasible control step [1,2] add-32: feasible control step [2,3] add-32: feasible control step [2,3] step1 16 16 16 A(32,2) = 64 A(16,1) = 48 step2 32 32 A(32,3) = 48 A(16,2) = 48 step3 16 32 A(32,2) = 64 A(32,3) = 48
Scheduling and Binding Algorithm-2 • Simultaneous scheduling and binding with consideration of interconnect delay • After operation o and c is chosen, FU binding is performed to decide whether o can be scheduled at step c finally • There is an available FU usable by o at step c • Data dependence between o and its scheduled and bound predecessors and successors is maintained island MUL step1 16 * step2 + 1 clock cycle step3 + ADD island
Outline • Motivation • Bitwidth-aware synthesis flow • Scheduling and binding to minimize total bits of functional units (FU) • Minimum weighted-interval-graph coloring problem for register allocation and binding • Experimental results • Conclusion
Register Allocation and Binding • Problem formulation • Given: A scheduled DFG annotated with bitwidth • Objective: Perform register allocation and binding to minimize the total bitwidth of registers • Register allocation • Decide the minimum required registers • Register binding • Explicitly map variables to register instances
Preliminaries 5 5 18 18 24 24 16 16 Weighted interval graph • A proper coloring of G corresponds to a register allocation and binding scheme • Weight of a coloring scheme • The weightofcolor c W(c) = max{w(v) | v is colored with c } • The weightofthecoloringschemeP is defined as W(G, P) = W(c). Life times of variables Scheduled DFG • Lifetime of a variable • s(o): the control step where variable o is produced • e(o): the last control step where variable o is consumed • 24+16+18 = 58
Coloring Problem • Weighted-interval-graphcoloringproblem • Given: A weighted interval graph G(V, E) • Objective: Find a coloring scheme P of G, such that the weight of the coloring scheme P, W(G, P), is minimized • Uniform weights • Be solved in polynomial time (Left-edge) • Various weights • The complexity remains unknown • We propose a lower-bound estimation and an efficient algorithm
Lower-Bound Estimation 5 18 24 16 |C24| 1 |C18| 1 5 18 |C16| 2 24 16 |C5| 3 Life times of variables Scheduled DFG Bitwidth lower bound 24*1+16*1+5*1=45
Coloring Algorithm 5 18 24 16 5 5 5 18 18 24 16 16 24 Weight of coloring 24*1+16*1+5*1=45 Scheduled and bound DFG Life times of variables
Outline • Motivation • Bitwidth-aware synthesis flow • Scheduling and binding to minimize total bits of functional units (FU) • Minimum weighted-interval-graph coloring problem for register allocation and binding • Experimental results • Conclusion
Experimental Results-Three Synthesis Flows • Flow1 (MCAS) • MCAS generates the scheduling and binding results and placement information. All operations and variables have uniform bitwidth (32-bits). • Flow2 (MCAS+MB-PP) • Perform a bitwidth post-processing after Flow1 is done, which is to set the bitwidth of a FU as the maximum bitwidth of all operations executed on it, and set the bitwidth of a register as the maximum bitwidth of all variables stored in it. • Flow3 (MCAS-MB) • After MCAS generates the scheduling and binding results and placement, the lower-bound-based scheduling & binding and the bitwidth-aware register allocation and binding are performed. • Share the same backend to generate datapath and controllers • Altera’s Quartus II version 2.2 0 is used to synthesize the resulting RTL VHDL onto the FPGA device StratixTM EP1S80F1508C6
Experimental Results-Comparison of the Three Synthesis Flows • LE: Area results for datapath and control logic in terms of logic element • WL: Wire-length
Conclusions • We presented a complete bitwidth-aware high-level synthesis flow based on MCAS synthesis system • Experimental results • Our bitwidth-aware synthesis flow achieves significant reduction for area and wire-length
Reference • J. Choi, J. Jeon and K. Choi, “Power Minimization of Functional Units by Partially Guarded Computation,” Proc.ofISLPED, 2000 • J. Cong, Y. Fan, X. Yang, and Z. Zhang, “Architecture and Synthesis for Multi-Cycle Communication,” Proc. Of International Symposium on Physical Design, 2003. • J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, "Architecture and Synthesis for On-Chip Multicycle Communication," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2004 • G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Optimal Datapath Allocation for Multiple-Wordlength Systems,” IEEEElectronicsLetters, 2000 • G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Heuristic Datapath Allocation for Multiple Wordlength Systems,” Proc.ofDesign, AutomationandTestinEurope (DATE), 2001 • K. Kum and W. Sung, “Combined Word-Length Optimization and High-Level Synthesis of Digital Signal Processing Systems,” IEEETrans.onComputerAidedDesignofIntegratedCircuitsandSystems, 2001 • M. C. Molina, J. M. Mendias, and R. Hermida, “High-Level Synthesis of Multiple-Precision Circuits Independent of Data-Objects Length,” Proc.ofthe39thDesignAutomationConference, 2002 • A. Sharma and R. Jain, “Estimating Architectural Resources and Performance for High-Level Synthesis Applications,” IEEETrans.onVLSISystems, 1993 • M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” Proc.oftheACMSIGPLAN'2000ConferenceonProgrammingLanguageDesignandImplementation, 2000