Bitwidth-Aware High-Level Synthesis for Designing Low-Power DSP Applications

Bitwidth-Aware High-Level Synthesis for Designing Low-Power DSP Applications G. Lhairech-Lebreton, P. Coussy, D. Heller, E. Martin University Bretagne-Sud / Lab-STICC ghizlane.lhairech@univ-ubs.fr December 14, 2010

Outline • Introduction • The proposed approach • Experimental results • Conclusion & perspectives

Context Golden specification (floating point) Signal to Noise Ratio MATLAB, C, … Refinement methodologies Refinement Finite precision specification C, SystemC, Algorithmic C, … Automatic design Manual design High-Level Synthesis VHDL, VERILOG, … Uniform or bit-accurate datapath

Allocation Scheduling Operator binding Register binding High-Level Synthesis Finite precision specification • Translates the specification into an internal format Compilation Internal format • Defines the number of operators Constraints • Defines the execution start time for each operation • Chooses the operator to execute the operation operator library • Defines the number of registers to store the data Uniform RTL Architecture

Bitwidth-Aware High-Level Synthesis Finite precision specification • Keep the bitwidth information in the internal format Compilation Bit-accurate Internal format Internal format • Group the operation into clusters (define new types) Clustering Allocation Constraints Scheduling Operator binding operator library Register binding • Resize the datapath Resizing Bit-accurate Architecture

Bitwidth-Aware High-Level Synthesis Finite precision specification Compilation • Related works propose HLS flows that either • consider bitwidth only in particular synthesis steps to design area optimized architectures or • consider uniform bit-width specification to design low-power architecture Bit-accurate Internal format Internal format Clustering Allocation Constraints Scheduling Operator binding operator library Register binding Resizing Bit-accurate Architecture

The proposed approach • A high-level synthesis flow that inputs bit-accurate specification and generates a bit-accurate architecture. • Our approach optimizes both area and power consumption of hardware architectures • By taking into account the bitwidth during all the HLS steps • By reducing the number of multiplexers

The proposed design flow C/C++ Specification Compilation Compilation Bit accurate DFG DFG Bit-width and range propagation Library characterization Library characterization Constraints • Clock • Interval Iteration Bit-accurate operator library operator library Allocation Scheduling Operator binding Register binding VHDL RTL Architecture Bitwidth-aware

The proposed design flow C/C++ Specification Compilation Compilation Bit accurate DFG DFG Bit-width and range propagation Library characterization Constraints • Clock • Interval Iteration Clustering Bit-accurate operator library operator library Allocation Scheduling Scheduling Operator binding Operator binding Register binding Register binding Resizing VHDL RTL Architecture Bitwidth-aware

Clustering Allocation Scheduling Operator binding op1 op0 op3 op4 op7 op6 op9 Register binding 5 2 3 2 2 5 3 6 18 16 3 5 5 16 * * * + * + * Resizing 2 5 2 3 5 3 6 18 ADD_1 op9 op3 * op0 op6 * + * MUL_2 2 5 5 16 16 3 MUL_3 16 op4 op1 op7 * * + ADD_2 16 16 16 Bitwidth-aware clustering • A cluster is a group of operations implemented by the same type of operator • The operations are grouped according to their propagation time Propagation time (N° cycles)

Operations are scheduled using a modified list scheduling Priority is the operation mobility combined with the Bitwidth Clustering Allocation Scheduling Operator binding Register binding Resizing Bitwidth-aware scheduling

Operations are scheduled using a modified list scheduling Priority is the operation mobility only Clustering Allocation Scheduling Mobility Operator binding * op3 op6 op9 op1 op4 Register binding + op0 op7 Resizing Bitwidth-aware scheduling Total number of bits for the multipliers 52 2 5 2 5 3 6 18 3 16 5 3 3 5 2 6 18 5 3 2 16 + * * * op9 op0 op3 op6 * + * * op0 op3 op6 Step 1 16 2 5 16 5 16 3 Op7 op1 op9 op4 Step 2 op1 op4 op7 * * + 16 16 16

Operations are scheduled using a modified list scheduling Priority is the operation mobility combined with bitwidth Clustering Allocation Scheduling Mobility Operator binding * op3 op6 op9 op1 op4 Register binding + op0 op7 Resizing Bitwidth-aware scheduling Gain of 14 bits Total number of bits for the multipliers 38 2 5 2 5 3 6 18 3 16 3 3 5 2 5 6 18 3 2 + * * * op9 op0 op3 op6 * + * * op0 op3 op6 Step 1 16 2 5 16 5 16 3 Op7 op4 op1 op9 Step 2 op1 op4 op7 * * + 16 16 16

Operations are bound on available resources by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding MUL1 + op2 op9 Register binding * Step 0 Resizing Step 1 op0 op3 op6 + * * MUL2 op1 * Step 2 op9 op1 op4 op7 * * * + op4 * Step 3 op8 + Bitwidth-aware operator binding ADD1 MUL3

Operations are bound on available resources by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding MUL1 + op2 op9 Register binding * Step 0 Resizing Step 1 op0 op3 op6 + * * MUL2 op1 * 2 Step 2 op9 op1 op4 op7 * * * + op4 * Step 3 op8 + Bitwidth-aware operator binding ADD1 * MUL3

Operations are bound on available resources by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding MUL1 + op2 op9 Register binding * Step 0 Resizing Step 1 op0 op3 op6 + * * MUL2 op1 * 2 Step 2 op9 op1 op4 op7 * * * + op4 * 1,5 Step 3 op8 + Bitwidth-aware operator binding ADD1 * MUL3 +

Operations are bound on available resources by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding MUL1 + op2 op9 Register binding * Step 0 Resizing Step 1 op0 op3 op6 + * * MUL2 op1 * 2 Step 2 op9 op1 op4 op7 * * * + op4 * 1,5 Step 3 op8 + Bitwidth-aware operator binding ADD1 0 0 0 2 0 0 0 MUL3 0

Operations are bound on available resources by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding + op2 op9 Register binding * Step 0 Resizing Step 1 op0 op3 op6 + * * Step 2 op9 op1 op4 op7 * * * + op4 * 1,5 Step 3 op8 + Bitwidth-aware operator binding ADD1 0 0 MUL2 MUL3 0

Operations are bound on available resources by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding + op2 op9 Register binding * Step 0 Resizing Step 1 op0 op3 op6 + * * Step 2 op9 op1 op4 op7 * * * + Step 3 op8 + Bitwidth-aware operator binding ADD1 0 MUL3

Variables are bound on allocated register by using a Maximum Weighted Bipartite Matching Weights combines bitwidth and number of multiplexers Clustering Allocation Scheduling Operator binding Register binding Resizing Bitwidth-aware register binding

This step computes the number of bits needed for each operator input The maximum number of bits for each operand Clustering Allocation Scheduling Operator binding Register binding 8 4 4 8 * * Resizing op1 op1 9 9 3 3 * * op2 op2 8 4 9 9 e1 e1 e2 e2 MUL MUL Operator size determination By using commutability MAX BEST Gain : 4 bits

Experiments (3) Resizing + MBWM ops+reg (1) Only Resizing (2) Resizing + LEA Our approach Full bit-aware Clustering Allocation Scheduling Operator binding Register binding Resizing

Experimental setup • Environment • High Level Synthesis : Our tool (GAUT) • Logic synthesis : ISE 10.1 • Target device : FPGA Xilinx Virtex5 xv5110T • Applications • Low-pass video line filter • 1st order, non-linear, Volterra • 16-tap Finite Impulse Response Filter • 16-tap Least Mean Square • 32-points Fast Fourier Transform • 32*32 Discrete Cosine Transform For each application several bitwidths (from 8 to 32) have been considered for input data

Experimental results : Power

Experimental results : Area

Experimental results : trade-off

Conclusion & perspectives • Our approach is based on bit-width aware clustering, scheduling, binding and sizing steps • Both bitwidth and multiplexers are optimized • Both area and power consumption are reduced (resp. 30% and 25% in average • Up to 64% power reduction for a same bit-width • Our approach generates optimized accelerators which will be used in a complete low-power framework

Bitwidth-Aware High-Level Synthesis for Designing Low-Power DSP Applications G. Lhairech-Lebreton, P. Coussy, D. Heller, E. Martin University Bretagne-Sud / Lab-STICC ghizlane.lhairech@univ-ubs.fr December 14, 2010

Bitwidth-Aware High-Level Synthesis for Designing Low-Power DSP Applications

Bitwidth-Aware High-Level Synthesis for Designing Low-Power DSP Applications

Presentation Transcript

Designing Applications Using DSP Modules

Binding, Allocation and Floorplanning in Low Power High-Level Synthesis

High Level Synthesis

Bitwidth-Aware Scheduling and Binding in High-Level Synthesis

IL2200 - High Level Synthesis

Program Synthesis for Low-Power Accelerators

High-Level Synthesis

L10 : Lower Power High Level Synthesis(1)

Compilers for DSP Processors and Low-Power

Validating High-Level Synthesis

Power Aware Synthesis

L9 : Low Power DSP

L12 : Lower Power High Level Synthesis(3)

Lower Power High Level Synthesis

L11: Lower Power High Level Synthesis(2)

High-Level Synthesis-II

L13 :Lower Power High Level Synthesis(3)

Data Aware Low Power High Throughput Modified Radix Fft Processor for Wpan Applications

Bitwidth-Aware Scheduling and Binding in High-Level Synthesis

High-level synthesis

High-Level Synthesis

High-level Synthesis Transformations