Architecture-Level Synthesis for Automatic Interconnect Pipelining

Architecture-Level Synthesis for Automatic Interconnect Pipelining Jason Cong, Yiping Fan, Zhiru Zhang VLSI CAD Lab Computer Science Department University of California, Los Angeles {cong, fanyp, zhiruz}@cs.ucla.edu Funded by GSRC, NSF, and Altera Corp.

Outline • Motivation • Our contributions • RDR-Pipe micro-architecture • Regular Distributed Register micro-architecture with interconnect pipelining • Synthesis flow and algorithms • MCAS-Pipe: automatic interconnect pipelining and sharing • Experimental results • Conclusions

Interconnect Bottleneck in Nanometer Designs • Challenge: single-cycle full chip communication will be no longer possible • Not supported by the current CAD toolset 5 cycles • ITRS’01 0.07um Tech • 5.63 GHz across-chip clock • 800 mm2 (28.3mm x 28.3mm) • IPEM BIWS estimations • Buffer size: 100x • Driver/receiver size: 100x • Semi-global layer (Tier 3) • Can travel up to 11.4mm in one cycle • Need 5 clock cycles From corner to corner 4 cycles 3 cycles 2 cycles 1 cycle 28.3 11.4 22.8 0

Related Work • Retiming with placement or floorplanning • Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03] • Retiming + floorplanning [Chong & Brayton, IWLS’01] • Retiming + placement for FPGAs [Singh & Brown, FPGA’02] • Global wire pipelining in ItaniumTM processor • [McInerney et al. ISPD’00] • Buffer and flip-flop insertion in RTL • [Lu et al. DATE’02] • [Cocchini, ICCAD’02]

In a loop, 4 logic cells, 2 registers • Cell delay = 1ns • Interconnect delay = 1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns • Clock period  4ns Limitation during Logic/Physical Level to Explore Multicycle Communication • Minimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94] • Interconnect pipelining by flip-flop insertion ? • Requires considerable amount of manual rework on the original RTL descriptions

Our Approach • Consideration of multicycle communication during architectural (or behavioral) synthesis • [Cong et al, ISPD’03] [Cong et al. ICCAD’03] • Regular Distributed Register (RDR) micro-architecture • Highly regular • Direct support of multicycle on-chip communication • MCAS: Architectural Synthesis for Multi-cycle Communication • Efficiently maps the behavioral descriptions to RDR uArch • Integrates architectural synthesis (e.g. resource binding, scheduling) with physical planning • This work • Extension of RDR and MCAS for interconnect pipelining

… … Reg. file Reg. file … Reg. file Island FSM FSM FSM LCC LCC LCC 2 cycles 1 cycle K cycle …. 2 cycle FSM Local Computational Cluster (LCC) Hi K cycles Global Interconnect MUL MUX 1 cycle … … … Reg. file Reg. file Reg. file ALU FSM Wi FSM FSM LCC LCC LCC Regular Distributed Register Micro-Architecture • Distribute registers to each “island” • Choose the island size such that local computation and communication in each island can be done in a single cycle • Use register banks: registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Wiring Overhead in RDR Designs • Data transfers r1r3 and r2r4 are overlapped • Two dedicated global wires are needed + ALU1 r1 + r1 r2 r2 r3 r3 r4 MUL1 Interconnects with delay of 2 cycles r4 * + * ALU1 MUL1 Sender register Receiver register

Pipeline Register Station (PRS) 3 1 2 4 PRS PRS FSM FSM FSM Reg. File LCC LCC LCC 3 2 1 V channel H channel PRS PRS FSM FSM FSM LCC LCC LCC 6 4 5 Architectural Solution: RDR-Pipe • Keep the intra-island structures • Inter-island pipeline register station (PRS) for global communications • PRS performs autonomous store-and-forward • Synchronous design • No global control signal needed for PRS

+ ALU1 r1 + r1 r1 r3 r2 r3 r4 MUL1 2 cycle communication r4 * Sender register Receiver register + * ALU1 MUL1 Pipeline register Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Reducing Wiring Overhead in RDR-Pipe • Data transfers are pipelined • One wire with a pipeline register is enough

Synthesis Flow: MCAS-Pipe System • Global interconnect sharing • After scheduling and functional unit binding • Before register and port binding • Enable multiple data communications to shar a physical link (a wire with pipeline registers) • Advantages over MCAS • Expect to reduce global wiring demand • No multicycle path constraint needed C / VHDL MCAS-Pipe CDFG generation CDFG Resource allocation & Functional unit binding ICG Scheduling-driven placement Locations Placement-driven rescheduling & rebinding Global interconnect sharing Register and port binding Datapath & FSM generation RTL VHDL & Floorplan constraints

Pipeline register Sender register Receiver register pg Cycle 1 Cycle 2 pe Cycle 3 Cycle 4 Cycle 5 Cycle 6 ce cg Cycle 7 Conflicted data transfers A B D = 2 pg Cycle 1 A B ce D = 2 Cycle 2 pe,pg cg ce pe pe Cycle 3 cg pg Cycle 4 • Now, two producer registers can be merged, since their life-times become compatible Cycle 5 • Only one physical link is required to support the scheduled data transfers Cycle 6 ce cg Cycle 7 Compatible data transfers Global Interconnect Sharing • Two physical links are needed to support the concurrent data transfers A B D = 2 pe ce pg cg

Global Pipelined Interconnect Minimization • Definitions • Data links: pipelined global interconnects • Channel: set of data links between two islands • Width of a channel: number of its data links • Data transfer: movement of data from a producer to a consumer • Architectural assumption • Channels cannot share interconnects • Theorem • Global pipelined interconnects are minimized if and only if the width of every channel is minimized

Transfer Scheduling for a Single Channel • A decision problem formulation • Given: • A channel (A, B)containing m data links • A data transfer set {e | pe A and ce B}, where each transfer e is associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit effective occupancy time • Fact: for every time slot, at most one transfer can be issued on a data link • Objective: to find a feasible transfer schedule on these data links • Transfer scheduling is polynomial solvable • A special real-time scheduling problem [J. Blazewicz, 1979] • Binary search for minimum feasible channel width m • For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn) • Overall time complexity: O(nlog2n)

Data Link 1 Data Link 1 Data Link 2 1 3 4 5 2 6 • Ordered by left edge EDF-Based Transfer Scheduling Example Data Link 2 Time slot Time slot • Successfully scheduling onto 2 data links 1 1 2 5 2 3 4 6 3 4 5 6 • Ordered by Earliest-Deadline-First 4 1 3 5 2 ? • Failed for 2 data links!

Experiment Settings C / VHDL CDFG generation Functional unit allocation & binding uArch. spec. Target clock period Conventional flow Scheduling-driven placement Placement-driven rebinding & rescheduling Conventional Scheduling MCAS flow Global interconnect sharing MCAS-Pipe flow Register and port binding Datapath & Control generation RTL VHDL files (for all flows) Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only) Altera QuartusII + Stratix

Experimental Results: Register and LE Usage • Design environment: Altera QuartusII, Stratix EP1S40 • MCAS vs. Conventional flow: • Uses more registers and logic elements (LE) • MCAS-Pipe vs. MCAS: • Slightly more registers, and comparable logic element cost

Experimental Results: Performance • Design environment: Altera QuartusII, Stratix EP1S40 • MCAS vs. Conventional flow: • 36% reduction in clock period and 30% in total latency • MCAS-Pipe vs. MCAS: • Comparable design performance (4% better) Total latency Clock period

Interconnect Structure of Altera’s Stratix Global: H24 H8 H4 Local: LL, LO Global:V16 V4 V8

Experimental Results: Wirelength • Wire types • LL, LO: local wires; H4, V4, H8, V8: short global wires • V16, H24: long global wires • MCAS-Pipe vs. MCAS: • 28.8% long global wires reduction, 19.3% total wirelength reduction

Conclusions • High-level automatic on-chip interconnect pipelining • RDR-Pipe: extension of RDR micro-architecture • Micro-architecture supporting interconnect pipelining • MCAS-Pipe: enhancement of MCAS synthesis system • Add in a novel global interconnect sharing algorithm to effectively reduce the global wiring • Experimental results • Matches or exceeds the RDR-based approach in performance • Greatly reduces wiring demand

Thank you

Architecture-Level Synthesis for Automatic Interconnect Pipelining