400 likes | 414 Views
A Polyhedral-based SystemC Modeling and Generation Framework for Effective Low-power Design Space Exploration. Wei Zuo 1 , Warren Kemmerer 1 , Jong Bin Lim 1 , Louis-Noel Pochet 2 , Andrey Ayupov 3 , Taemin Kim 3 , Kyuntae Han 3 , Deming Chen 1
E N D
A Polyhedral-based SystemC Modeling and Generation Framework for Effective Low-power Design Space Exploration Wei Zuo1, Warren Kemmerer1, Jong Bin Lim1, Louis-Noel Pochet2, Andrey Ayupov3, Taemin Kim3, Kyuntae Han3, Deming Chen1 1 University of Illinois at Urbana-Champaign 2Ohio State University 3Strategic CAD Labs of Intel Corporation
System-on-Chip(SoC) Proliferation • SoChasagreatimpactondesignmetrics • Deal with the fast growing design complexity • Virtual platform based hardware/software co-design • SystemC for high-level modeling • Accuracyofcomponents modeling is the precondition ITRS 2007 SoC Consumer Portable Design Complexity Trends
Accelerator Design in SoC Accelerator design is critical in SoC design Out-of-Core Accelerators Source: Shao and Brook @ Havard [http://vlsiarch.eecs.harvard.edu/accelerators/die-photo-analysis]
Challenges of the Accelerator Design/Modeling in SoC • How to accurately model performance and power early on? • Essential to enable rapid prototyping of SoC devices • Difficult with no physical information available • How to dramatically improve design productivity • Traditional design flow is too slow: mainly rely on manual process • How to explore the large design space? • Micro-architecture choices for performance and cost trade-off • Identify the optimality is crucial but difficult • System-level automation is the key
Overall Framework (1) • Automated C-to-SystemC transformation engine • Tile based characterization flow for latency and power estimation • Two versions of SystemC code generation • For system-level modeling and high-level synthesis
SystemC Generation Framework • Tile the loops using polyhedral transformations • What is polyhedral transformation and why is this critical?
Polyhedral Transformation • Fine-granularity optimization for affine programs for HLS • Expose parallelismand localityfor parallelization and data reuse • Atomic tiles and the data reuse buffers and data transfers operations • Regularity is key for accurate power and latency characterization for (i=0; i<N; i++) for(j=0; j<N; j++) s1: A[i, j] += u1[i] *v1[j]+u2[i]*v2[j]; for(k=0; k<N; k++) for(l=0; l<N; l++) s2: x[k] += A[l, k]*y[l]; (c) Dependence for edge s1s2 (b) Domain of s1 (a) Original Code for s1 for s2 for (c1=0; i<N; i++) for(c2=0; j<N; j++) { A[c2, c1] += u1[c2] *v1[c1]+u2[c2]*v2[c1]; x[c1] += A[c2, c1]*y[c2]; } (d) Scheduling functions for s1 and s2 (e) Transformed code based on scheduling function in (d)
SystemC Generation Framework • Dilemma: • Latency and power accuracy relies on physical information • GeneratetheRTLfortheentiredesignof different instances isnotscalable • Tilebasedcharacterization: • Extract tiles and separate them into components: • Computation blocks, communication channels and memory blocks • Separately characterize the power and latency of these parts • Information extracted from gate-level simulation • Build power model for each part considering different input switching activities
Characterization Flow • General architecture of generated accelerator Main Mem Hardware local mem SA localmem SA … acc_tile1 acc_tile2 acc_tileN SA SA SA … • Computation modules (acc_tile) • Read data from memories and compute • Local memories (local mem) • Storing data for the acclerator • Switching activity calculation function • Compute the input switching activity to guide the selection of power and latency
Characterization Flow • General architecture of generated accelerator SystemC (generated from stage 3) Main Mem Hardware Memory Library High Level Synthesis local mem SA localmem SA … RTL Code acc_tile1 acc_tile2 acc_tileN SA SA SA … Logic Synthesis Netlist Testbench Gate Level Power Analysis Simulator • Computation modules (acc_tile) • Read data from memories and compute • Local memories (local mem) • Storing data for the acclerator • Switching activity calculation function • Compute the input switching activity to guide the selection of power and latency Switching Activities (SW) Latency Power
Characterization Flow • General architecture of generated accelerator Main Mem Hardware Memory Library RTL memory wrapper local mem SA localmem SA … acc_tile1 acc_tile2 acc_tileN SA SA SA … Logic Synthesis Netlist Testbench Gate Level Power Analysis Simulator • Computation modules (acc_tile) • Read data from memories and compute • Local memories (local mem) • Storing data for the acclerator • Switching activity calculation function • Compute the input switching activity to guide the selection of power and latency Switching Activities (SW) Latency Power
The Look-up Table for Power Modeling • The look-up table indexed by input-switching activity • The power consumption is NOT directly proportional to the input switching activity • The irregularity of switching activity propagation within the accelerator is captured by the characterization data
SystemC Generation Framework • SystemC generation for modeling • Use polyhedral analysis to generate a SystemC model for the tiled kernels • Back-annotate the power and latency information to the SystemC model • SystemC simulation to compute the values for the entire design • SystemC generation for HLS • Cycle-accurate interface • Insert “wait()” statements for scheduling
Code Generation: An Example Unroll at this level: assume N/T = 4 //Top level module class FM_module: public sc_module{ void FM(){ while(1){ update_power(PW_MODE_ON, PW_PHASE_IDLE); for(it=0; it < N/T; it++){ … //start the tile threads scgen_tile_start[it] = true;} wait(); update_power(PW_MODE_ON, PW_PHASE_COMPUTE); //wait until all threads are finished while(!(scgen__tile_done[0] && … !scgen_tile_done[N/T-1])) wait(); sc_stop(); }}}; //One tile class FM_module_tile0: public sc_module{ for(jt=0; jt<N/T; jt++){ //communication blocks /*read “size” elements from mem1 with start address “sa”, to local_mem, with read delay “delay1”*/ copy_to_local(int sa, int size, int *local_A, sc_time &delay1; … /*computation kernel of accelerator, with power and latency counter*/ acc(); /*write“size” elements to x from local_X, with write delay2 “delay”*/ copy_to_mem(int sa, int size, int *local_X, sc_time &delay2); };
Overall Framework (2) • Automated C-to-SystemC transformation engine • Tile based characterization flow for latency and power estimation • Two versions of SystemC code generation • For system-level modeling and high-level synthesis • Analytical power and latency models • Design space is big • impossible to traverse the entire space even with SystemC simulation • Use hyper-surface based sampling method • Evaluate the power and latency of all the points in the design space
Analytical Modeling for Power and Latency Loop structures of application latency constraints Complete design space
Analytical Modeling for Power and Latency Loop structures of application latency constraints Sampling on the design space Sampling on tile size and unrolling factors
Analytical Modeling for Power and Latency Loop structures of application latency constraints * Generating SystemC for sampled points * Run SystemC simulation Sampling on tile size and unrolling factors SystemC model generation for sampled points SystemC simulation
Analytical Modeling for Power and Latency Loop structures of application latency constraints Surface fitting Sampling on tile size and unrolling factors SystemC model generation for sampled points SystemC simulation Surface fitting
Analytical Modeling for Power and Latency Loop structures of application latency constraints Generating modeled design space Sampling on tile size and unrolling factors SystemC model generation for sampled points SystemC simulation Surface fitting Modeled design space
Overall Framework (3) • Automated C-to-SystemC transformation engine • Tile based characterization flow for latency and power estimation • Two versions of SystemC code generation • For system-level modeling and high-level synthesis • Analytical power and latency models • Use hyper-surface based sampling method • Evaluate the power and latency of all the points in the design space • Design space is big • impossible to traverse the entire space even with SystemC simulation • Fast design space exploration • Design space pruning • Generate power and latency Pareto curve
Design Space Exploration User-defined power & latency constraints Generating modeled design space Power / Latency models
Design Space Exploration User-defined power & latency constraints Generating modeled design space Power / Latency models Design space pruning Pareto-optimal candidates
Design Space Exploration User-defined power & latency constraints C/C++ Generating modeled design space Power / Latency models Power & Latency annotated SystemC Design space pruning SystemC simulation Pareto-optimal candidates Pareto-optimal candidates
Design Space Exploration User-defined power & latency constraints C/C++ Generating modeled design space Power / Latency models Power & Latency annotated SystemC Design space pruning SystemC simulation Pareto-optimal candidates Pareto-optimal candidates Power and Latency info Pareto-optimal points SystemC model
An Example • Blue dots form the design space • Red dots are the frontiers Error Rate: Power: 4.1% Latency: 3.28%
Experiment (1): Accuracy of the SystemC Model against Gate-level Simulation • Setup: • 8-benchmarks • 45-nm standard cell library for computation blocks • 45-nm memory compiler for the memory blocks • All experiments target a frequency of 1GHz • Golden model: • The design of the accelerator generated by HLS • Experiment • Verify results in two settings
Accuracy of the Model for Different Switching Activities • Generate twenty input vector sets • Different switching activities from 0.1 to 0.95 • Each set includes 10000 vectors • As input to the SystemC as well as the golden model • Compare with the golden model
Experiment (2): Design Space Exploration • Generate Pareto point set
Experiment (2): Design Space Exploration • Generate Pareto point set • Evaluate accuracy • Analytical model vs. SystemC simulation
Experiment (2): Design Space Exploration • Generate Pareto point set • Simulation speed-up • Number of points in design space / Simulation Points
Experiment (2): Design Space Exploration • Generate Pareto point set • Benefit of the thickness addition • Number of points would be pruned away w/o thickness (true Pareto points)
Analysis of the DSE Results • Trade-off between power and latency • Different graph shapes • Communication-bound applications (Gemver) • Computation-bound applications (Correlation) Gemver Correlation
Analysis of the DSE Results • Communication dominated design (Gemver) • Increase parallel computations cause trival latency decrease • Optimization opportunity: • P1 vs. P2: 1.7 x less power, 4% longer latency • Communication dominated design (Correlation) • Effective power-latency trade-off Gemver Correlation
Accuracy of the Model for Different Switching Activities • Generate twenty input vector sets • Different switching activities from 0.1 to 0.95 • Each set includes 10000 vectors • As input to the SystemC as well as the golden model • Compare with the golden model