xPilot  A Platform-Based Behavioral Synthesis System

xPilot A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 Supported by NSF, GSRC, Altera, Xilinx.

Outline • Motivation • xPilot system framework • Overview of the synthesis engine • Scheduling • Resource binding • Experimental results

Motivation (1) • Design Complexity is outgrowing the traditional RTL method • Feasible to build SoC device with 500M transistors; Billion-transistor chips are on the horizon • Behavioral synthesis  a critical technology for enabling the move to higher level of abstraction • Reasons for previous failures • Lack of a compelling reason: design complexity is still manageable a decade of ago • Lack of a solid RTL foundation • Lack of consideration of physical reality

Motivation (2) • Behavioral Synthesis provides combined advantages • Better complexity management • Code size: RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04] • Shorter verification/simulation cycle • Simulation speed 100X faster than RTL-based method • Rapid system exploration • Quick evaluation of different hardware/software boundaries • Fast exploration of multiple micro-architecture alternatives • Higher quality of results • Full consideration of physical reality

xPilot: Platform-Based Behavioral to RTL Synthesis Flow • Presynthesis optimizations • Loop unrolling/shifting • Strength reduction / Tree height reduction • Bitwidth analysis • Memory analysis … Behavioral spec. in C/SystemC Platform description Frontendcompiler • Core synthesis optimizations • Scheduling • Resource binding, e.g., functional unit binding register/port binding SSDM • Arch-generation & RTL/constraints generation • Verilog/VHDL/SystemC • FPGAs: Altera, Xilinx • ASICs: Magma, Synopsys, … RTL FPGAs/ASICs

System-level Synthesis Data Model • SSDM (System-level Synthesis Data Model) • Hierarchical netlist of concurrent processes and communication channels • Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics • Port / IO interfaces, bit-vector manipulations, cycle-level notations

Platform Modeling & Characterization • Target platform specification • High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations • Functional units: adders, ALUs, multipliers, comparators, etc. • Connectors: mux, demux, etc. • Memories: registers, synchronous memories, etc. • Chip layout description • On-chip resource distributions • On-chip interconnect delay/power estimation

Scheduling  Goals • A highly versatile scheduling engine • Applicable to a wide range of application domains • Computation-intensive, data/memory-intensive, control-intensive, etc. • Mixed behavioral & RTL • Amenable to a rich set of scheduling constraints • Data dependency constraints • Resource constraints: IO ports constraints, memory ports constraints, Functional unit constraints, etc. • Timing constraints: Frequency constraint, Latency constraints, etc. • Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. • Retargetable to a variety of design objectives • High performance, small area, low power, etc.

Scheduling  Optimization Capabilities • Offers a variety of optimization techniques in a unified framework • Combinational/Sequential non-pipelined/pipelined multi-cycle operation • Unconditional/Conditional operation chaining • Relative scheduling • Considerations of branching probabilities and repetitions • Multi-cycle communication (under development) • Code motion & speculation (under development) • Functional / loop pipelining (under development) • Physical layout integration (to be supported)

Scheduling  Current Status • Design objective • Focus on high-performance designs • Overall approach • Use a system of pairwise difference constraints to express all kinds of scheduling constraints • Represent the design objective in a linear function • The system is immediately solvable via any linear programming solver with integral solutions

Constraint equations generation Relative timing constraintsDependency constraintsFrequency constraintsResource constraints … Objective function generation Linear programming solver LP solution interpretation Scheduling  Design Framework CDFG xPilot scheduler Target platformmodeling(resource library & chip layout) User-specified design constraints& assignments System of pairwise difference constraints STG (State Transition Graph)

Example : Greatest Common Divisor x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); • GCD C description BB1 x = inport1;y = inport2;while (x != y) { if ( x > y ) x = x – y; else y = y – x;}*outport = x; T x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1); BB2 T BB3 BB4 x_2 = x1 – y1;cond3 = (x_2 != y_1); y_2 = y1 – x1;cond4 = (x_1 != y_2); T T BB5 x_3 = (x_0, x_1, x_2);*outport = x_3;

u: x_1 = (x_0, x_1, x_2); v: cond2 = (x_1 > y_1); Constraints Generation • Data dependency constraint • Operation v is data dependent on operation u, i.e., (u, v)Es(v) – s(u)  0 where schedule variable s(v) represents the relative schedule of node v • Other constraints can be represented in a similar way … • The constraint equations form a system of pairwise difference constraints • Matrix A is totally unimodular • Feasibility check can be formulated as a single-source shortest path problem • Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problem

Solution by LP Solver x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); • Scheduling are performed across the basic block boundaries BB1 0 T x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1); BB2 T BB3 BB4 x_2 = x1 – y1;cond3 = (x_2 != y_1); y_2 = y1 – x1;cond4 = (x_1 != y_2); T T 1 BB5 x_3 = (x_0, x_1, x_2);*outport = x_3;

x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); if (cond1){ x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2){ x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2);*outport = x_3; Schedule Interpretation

x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0); if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } Deriving State Transition Graph • Final STG for GCD cond3 || cond4

Unified Resource Binding • Provides an unified resource sharing framework to optimize for various design objectives • Simultaneous functional unit binding, register binding and port binding • Equipped with advanced techniques to optimized the interconnect and steering logic networks • Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc. • Extendable to exploit physical layout information

R1 R2 R3 R4 R1 R2 R3 R4 F1 F2 MUX MUX F1 MUX R5 R5 (a) Case 1 Case 2 R1 R2 R1 R2 F1 F2 F1 MUX R3 R3 (b) Case 2 Case 1 An FU/Register binding Example • Observations: • Binding has large impact to the resulting performance and cost • Functional unit and register binding are highly correlated Note: Assume all operations and variables are compatible for sharing

Drawbacks of Previous Work • Many existing algorithms focus on functional-unit- or register- “number” minimization • Technology advances – interconnect effect increasing • 51% of the total dynamic power of a microprocessor in 0.13um tech. • Up to 80% of the dynamic power in future technologies • May generate larger amount of multiplexers and interconnects • Unfavorable performance and cost results • Optimization for unrealistic goals • Minimize “number” of FUs, registers, or multiplexors • Should have detailed datapath models to guide the optimization • No technology specific consideration • Should have platform-specific characterizations

Resource Binding in xPilot STG (State Transition Graph) xPilot architecture exploration Baseline Register Binding User-specified designconstraints Iteration FU Allocation/Binding Datapath model for performance-costestimation Register Allocation/Binding Target platform (resource library & chip layout) Improved?? Yes No STG + Best Datapath Models

1* 1* 3* 2* > 4* C1 5* C1’ 2*, 3* 4* 5* > < C2 C2’ < 6+ 6+ power pruned MUL MUL delay Design Space Exploration • Exploration phases: • Exploring Node 2: • (1) (2) two mul • (1, 2) one mul • Exploring Node 3: • (1) (2) (3) three mul • (1, 2) (3) two mul • (1, 3) (2) two mul • Exploring Node 4: • (1) (2) (3) (4) • (1, 2, 4) (3) • (1, 2) (3, 4) • (1, 2) (3) (4) • (1, 3, 4) (2) • (1, 3) (2, 4) • (1, 3) (2) (4) • …. Compatible Graphs A State Transition Graph (STG) Datapath for solution (1, 2, 4) (3) Datapath Model Curve for Design Space Pruning

Experimental Results  Benchmark Suite • Benchmark suite • PR, MCM: • DSP kernels: pure additions/subtractions and multiplications • CACHE • Cache controller: control-intensive designs with cycle-accurate I/O operations • MOTION: • Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations • IDCT: • JPEG inverse discrete cosine transform: computation intensive • DWT: • JPEG2000 discrete wavelet transform: computation intensive with modest control flow • EDGELOOP: • Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses

Experimental Results  Code Size Reduction

Experimental Results  Comparison with SPARK On Scheduling • SPARK [UCI/UCSD, 2004], a state of the art academic high-level synthesis tool

Experimental Results  Comparison with SPARK On Binding • On average, xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Spark

Synthesis Results for DWT (JPEG2000) • Settings • Target platform: Altera Stratix • RTL synthesis & place-and-route: Altera QuartusII v5.0 • Simulation: Mentor ModelSim SE6.0 • Design alternatives

Experimental Results: ASIC Flow • Magma RTL to GDSII flow • Technology library: Cadence Generic Standard Cell Library 0.18um • Tradeoff study: • 1st column: delay constraint enforced in xPilot • 2nd column: control step count of xPilot generated RTL • 3rd-5th column: data reported after mapping by Magma tool

Experimental Results: ASIC Flow (cont.)

xPilot  A Platform-Based Behavioral Synthesis System