260 likes | 451 Views
xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs. Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department. Motivation. Design complexity is outgrowing the traditional RTL method even in current CMOS technologies
E N D
xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department
Motivation • Design complexity is outgrowing the traditional RTL method even in current CMOS technologies • Nanotechnology will enable 10-100x increase in device density and degree of integration • Need to enable higher level of design abstraction • Start from behavior descriptions (e.g. C or SystemC) • Use and/or re-use more complex functional unit (e.g. processor cores instead of standard cells)
xPilot: Platform-Based Synthesis System SystemC/C Platform Description & Constraints • Uniqueness of xPilot • Platform-based synthesis and optimization • Communication-centric synthesis with interconnect optimization xPilot xPilot Front End Profiling SSDM(System-Level Synthesis Data Model) Analysis Mapping Processor & Architecture Synthesis Interface Synthesis Behavioral Synthesis Custom Logic Drivers + Glue Logic Processor Cores+ Executables FPSoC
xPilot: Behavioral-to-RTL Synthesis Flow • Presynthesis optimizations • Loop unrolling/shifting • Strength reduction / Tree height reduction • Bitwidth analysis • Memory analysis … Behavioral spec. in C/SystemC Platform description Frontendcompiler • Core synthesis optimizations • Scheduling • Resource binding, e.g., functional unit binding register/port binding SSDM • Arch-generation & RTL/constraints generation • Verilog/VHDL/SystemC • FPGAs: Altera, Xilinx • ASICs: Magma, Synopsys, … RTL + constraints FPGAs/ASICs
System-Level Exploration Using xPilot for Heterogeneous MPSoC Platforms • Heterogeneous MPSoCs exploration • Processors • Heterogeneous vs. homogeneous • General-purpose vs. application-specific • On-chip communication architecture (OCA) • Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) • Memory hierarchy μP μP μP μP μP μP μP IP μP μP FPGA DSP tasks tasks tasks OS Driver OS Driver OS Driver Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Communication Network
Outline • xPilot Overview • Behavior-level synthesis in xPilot • System-level synthesis in xPilot • Recent Progress in xPilot • Interface synthesis • Resource binding based on distributed register architecture • Conclusions
Advantage of Behavior Synthesis • Shorter verification/simulation cycle • Better complexity management, faster time to market • Rapid system exploration • Quick evaluation of different hardware/software boundaries • Fast exploration of multiple micro-architecture alternatives • Higher quality of results • Platform-based synthesis & optimization • Full consideration of physical reality
Example: Better Complexity Management • Shorter verification/simulation cycle • Simulation speed 100X faster than RTL-based method [NEC, ASPDAC04] • Significant code size reduction • RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] • VHDL code generated by UCLA xPilot targeting Altera Stratix platform • Over 10x code size reduction can be achieved
Unique Features of xPilot (1): Platform-based Synthesis & Optimization • Platform-based synthesis & optimization • The quality of a RTL design is platform-dependent • Designers often lack the complete and detail knowledge of the target platform (0,0) (95,61) 3X3 Delay Matrix • Platform: Altera Stratix • RTL synthesis & place-and-route: Altera QuartusII v5.0
Unique Features of xPilot (2): Communication-Centric Synthesis & Optimization • System performance & power is dominated by interconnect • It is difficult for designers to consider physical layout at the RT level mul1 (2,4,5) mul2 (3,6) > T F add1 5* 2*, 3* Binding solution 1: Both multipliers keep active Data transfer add2 6* 4* mul1 mul1 (2,5,6) mul2 (3,4) C2’ < mul2 Layout-aware performance optimizationOverlap computation with communication Binding solution 2: mul2 can be powered off when false branch is taken Layout-aware power optimization
Unique Features of xPilot (3):Highly Scalable and Optimized Synthesis Algorithms • Use of highly scalable and optimized synthesis algorithms for best quality of results • Interface synthesis: Simultaneous data and communication scheduling for latency minimization • Scheduling: A unified framework for multi-constraints and multi-objective scheduling based on the system of difference constraints (SDC) • Resource binding: Use of distributed register architectures for interconnect/communication optimization • Power optimization: Optimal functional module and voltage binding • …
Behavior and Communication Co-Optimization for Systems with SCM • SCM : Sequential Communication Media • FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.) • Data must be read and written in the same order • Order may have dramatic impact on performance • Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission C for (int i=0; i <8; i++) { S1: data[i] = …;} int s07 = data[0] + data[7]; Int s16 = data[1] + data[6];….. data[8] P2 P1 FIFO Custom Logic 1 Custom logic 2 PE2 PE1 DCT example
SCM Co-Optimization Problem Formulation • Given: • A set of processes P connected by a set of channels in C • A set of data D = {d1, d2, …, dm} to be transmitted on each channel cj, • Goal: • Find the optimal transmission order of each process, so that the overall latency of the process network is minimized subject to the given design constraints and platform specifications • In the meantime, generate the drivers and glue logics for each process automatically
Proposed SCM Co-Optimization Design Flow Platform Description & Constraints Process Network Front End System-Level Synthesis Data Model SCOOP (SCM CO-Optimization) Communication order detection Code transformation and interface generation Indices compression for loop reordering Drivers + Glue Logics Process Behavior
Process 1 T1 T1 * * T3 T3 * T1 T2 T2 + T2 Process 2 + + T3 + : FIFO Ti Latency = 7 cycles Latency = 5 cycles Communication Order Detection • Step 1. Construct a global CDFG by merging the individual CDFGs of each process • Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the total latency of the global CDFG
Loop Indices Compression • Given the optimal order, we try to generate restructured loops for code compression • i.e., given the original iteration and reordered iteration, find the minimum number of linear intervals to represent the new iteration space Original order: (0,0), (0,1), (1,0), (1,1) After reordering: (0,0), (1,0), (0,1), (1,1) Need to solve the linear system Solution: i’=j, j’ = i;
Preliminary Experimental Results • Experimental setting • Target communication model: two-process producer-consumer model • Behavioral synthesizer: UCLA xPilot • RTL simulator : Mentor ModelSim An average of 26% improvement in total latency can be achieved.
1 1 3 2 4 2 3 4 2 1 (a) (b) (c) Advantage of Register-File Microarchitectures • (a) A scheduled DFG with register binding indicated on each variable • (b) Binding using discrete registers • (c) Binding using a register file
Island B Island A On-chip memory blocks LocalRegisterFile Data-RoutingLogic FUP MUX Island C Input Buffers Functional Unit Pool MUL ALU ALU’ Distributed Register-File Microarchitecture FP-SoC Island A Island C Island B On-chip RAM resource(Virtex II and Stratix)
1 v6 v1 2 v7 v2 3 v3 v9 4 v5 v8 v10 v4 A B C D Resource Binding for DRFM • Facts under simplified assumptions • Operations bound onto an island form a chain in the given scheduled DFG • Inter-chain data transfers may share a physical inter-island connection • The number of inter-island connections is crucial to the QoR of a DRFM instance • Inter-island connections • (A,B)=(A,D)=1 • (A,C)=1, two data transfers share one connection • (C,D)=2
Resource Binding Problem for DRFM • General DRFM binding problem • Given scheduled DFG G and DRFM M, to find a feasible resource binding B(G,M), so that the quality of B is optimized. • Hard to characterize the quality of binding solution B • The problem is too ad-hoc • Relaxed problem – DRFM Binding for Minimizing Inter-Island Connections: • Given a scheduled DFG G and DRFM M, to find a feasible resource binding B(G,M), so that the total number of inter-island connections of B is minimized. • Solution: control-step by step binding with min-cost bipartite matching
Three Experimental Flows for Comparison xPilot Frontend xPilot behavioral synthesis system SSDM/CDFG Scheduling algorithms Scheduled CDFG (STG) 1) Binding on Discrete-Register Microarchitecture 2) Baseline (Random) DRFM Binding 3) DRFM Binding forMinimizing Inter-Island Connections RTL generation Xilinx Virtex II
Experimental Results • Xilinx ISE 7.1; Virtex II; Target clock period: 8ns • The baseline DRFM binding results achieve 46.70% slice reduction over the discrete-register approach • Optimized DRFM binding reduces 12.21% further • Overall, more than 2X logic slice reduction with better clock period (7.8%). Area (Slices, DRF solutions use on-chip RAM blocks) Clock period (ns)
Conclusions • xPilot can automatically synthesize behavior level C or SystemC presentation to RTL code with necessary design constraints • Platform-based synthesis with physical planning provides • Shorter verification/simulation cycle • Better complexity management, faster time to market • Rapid system exploration • Higher quality of results • xPilot can help to explore the efficient use of (multiple) on-chip processors • xPilot can efficiently optimize the software for reconfigurable processors • We are interested to engage with selected industrial partners to further validate and enhance the technology
Acknowledgements • We would like to thank the supports from • National Science Foundation (NSF) • Gigascale Systems Research Center (GSRC) • Semiconductor Research Corporation (SRC) • Industrial sponsors under the California MICRO programs (Altera, Xilinx) • Team members: Yiping Fan Guoling Han Wei Jiang Zhiru Zhang