170 likes | 188 Views
ECE 565 High-Level Synthesis—An Introduction. Shantanu Dutt ECE Dept., UIC. HLS Flow. Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects).
E N D
ECE 565High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC
HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
HLS Flow (contd) Taken into consideration during register allocation (post scheduling). Taken into consideration during scheduling. (Binding) Allocation: Simple counting of FUs after the above 2 stages
ldd ldc ldx c d ldy x y I1 I0 I0 I1 ldb lda mux a b mux mux2 mux1 + X 1 2 3 4 5 6 demux demux cc 3i+1 ldz z reg. “a” loaded lda = 1 Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc (a) Scheduling i) Non-overlapped pipelined scheduling: Schedule an operation when i/p data and FU available (may need to break ties between competing operations) (b) Arch. Synthesis: Binding & FU, reg, mux/demux allocation and interconnection X c1(1) c1(2) + c2(1) c3(2) c3(1) c2(2) O1 O0 cc’s (c) Controller FSM Synthesis mux1=0, mux2=0 demux=0, ldy=1 [y c+d] (c2) Controller FSM: cc 3i Reset Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 Note: Unspecified control signals (cs) have either an inactive value, or if such a concept doesn’t exist for the cs, then the don’t-care value ldx=1 cc 3i+2 [x a x b] (c1) [z x+y] (c3)
ldd ldc ldx c d ldy x y I1 I0 I0 I1 ldb lda mux a b mux mux2 mux1 + X demux demux 1 2 3 4 5 6 ldz z Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) (a) Scheduling ii) Overlapped pipelined scheduling X c1(1) c1(2) (b) Arch. Synthesis + c2(1) c3(1) c2(2) c3(2) cc’s (c) Controller FSM Synthesis cc 3i+1 ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 [z x+y,] (c3) Controller FSM: cc 3i Reset • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 [y c+d, x a x b] ((c1, c2)
in1 in in2 T F Distributor • Some DFG control operation nodes: Selectot T F Condition (T/F) Condition (T/F) out out2 out1 Simple HLS Examples (contd) • Conditional code: If (a > b) then c a-b; Else c b-a; • Possible DFGs corresponding to the above conditional code: • Note that the 2 subs in the left dfg does not mean an HLS algorithm will use 2 subtractors/adders. A good one will use 1, which will be shared in a mutually exclusive way between the two subs that are in two different sections of an if-then-else
Iterative code: while (a > b) a a-b; b a T F sel > - Initialized to F dist T F a + c1 c2 c1 c2 Scheduling & binding: cc’s Simple HLS Examples (contd) a r1 b ldb lda 1 0 Mux b’ mux a b’+1 = 2’s compl. of -b c2 c1 To fsm + cin 1 s xor ovfl = 1 -ve = 0 +ve ldr1 and (s xor ovfl) demux Demux 0 1 ldfina (a) Scheduling (using only 1 adder/sub) final a (b) Arch. Synthesis
Delay Nodes in DFGs A delay node is generally implemented as a register (or a series of registers if clock period < T0); a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd) register Mapping to the architecture w/ the register decoupling input and output s.t. register i/p = o/p of combinational part and register o/p = i/p of combinational part, and these can be treated as independent of each other as their availabilities are in different time steps (e.g., clock cycles) Transformation in the DFG
Detailed HLS Example (contd) Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available closest to u’s earliest finish (i.e., asap time of child is earliest), otherwise the FU(s) will be idle unnecessary leading to a larger latency (this will also reduce lifetimes of sibling o/ps). Different paths (i/p o/p) in the DFG (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); Goal:Miinimize latency (b) Reg. alloc. for o/p of operations For WAR constraint [can’t store in d1 as would be natural, as d1’s current data yet to be consumed by c6 which has not been scheduled yet] (c) Arch. synthesis Note: Above register allocation for adder has been done w/ separate regs for multiplier and adder o/ps. It is sub-optimal (4 non-primary i/p regs. needed) The synthesized architecture
Detailed HLS Example—Register Allocation (contd) Scheduling heuristic: As stated earlier d0 3 non-primary i/p regs. needed • In the conflict graph (one per FU [as here] if regs are grouped by FU, else one per FU type if regs are shared across each FU type or only one [global] if regs are shared across FUs), there is an edge between 2 variable nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) • Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)
Detailed HLS Example—Register Allocation (contd) d0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties arbitrarily: B’s lifetime increases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information