1 / 18

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget Low level tools ineffective

leland
Download Presentation

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan 1

  2. app.c LA LA LA LA Automated C to Gates Solution • SoC design • 10-100 Gops, 200 mW power budget • Low level tools ineffective • Automated accelerator synthesis for whole application • Correct by construction • Increase designer productivity • Faster time to market 2

  3. Transform Quantizer Coder Coded Image Image Inverse Quantizer Inverse Transform Motion Estimator Motion Predictor H.264 Encoder OVSF Generator Data out Baseband Trasmitter Data in Block Interleaver Conv./ Turbo CRC RRC Filter Spreader/ Scrambler W-CDMA Transmitter Streaming Applications • Data “streaming” through kernels • Kernels are tight loops • FIR, Viterbi, DCT • Coarse grain dataflow between kernels • Sub-blocks of images, network packets 3

  4. Kernel 1 Kernel 1 Kernel 1 K2 K2 K2 K3 K3 K3 Kernel 4 Kernel 4 Kernel 4 Kernel 5 Kernel 5 Kernel 5 System Schema Overview LA 1 Kernel 1 Task throughput Kernel 2 Kernel 3 LA 2 time Kernel 4 Kernel 5 LA 3 4

  5. inp row_trans tmp1 col_trans tmp2 zigzag_trans out Input Specification • System specification • Function with main input/output • Local arrays to pass data • Sequence of calls to kernels • Sequential C program • Kernel specification • Perfectly nested FOR loop • Wrapped inside C function • All data access made explicit row_trans(char inp[8][8], char out[8][8] ) { dct(char inp[8][8], char out[8][8]) { for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; } } char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out); } } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]); 5

  6. K1 LA 1 K1 TC=100 100 K2 K2 200 TC=100 K1 K3 300 K4 K2 K3 400 TC=100 K3 K4 K3 TC=100 System Level Decisions • Throughput of each LA – Initiation Interval • Grouping of loops into a multifunction LA • More loops in a single LA → LA occupied for longer time in current task Throughput = 1 task / 200 cycles LA 2 LA 1 occupied for 200 cycles LA 3 6

  7. LA 1 LA 1 K1 II=1 K1 TC=100 K1 100 100 tmp1 K2 K2 K1 LA 2 LA 2 K2 200 200 II=1 TC=100 K1 K3 K3 K2 tmp2 300 300 K2 K3 K3 II=1 TC=100 LA 3 LA 3 K3 System Decisions (Contd..) • Cost of SRAM buffers for intermediate arrays • More buffers → more task overlap → high performance tmp1 buffer in use by LA2 Adjacent tasks use different buffers 7

  8. 1 1 1 LA 1 512 cycles 1 1 1 2 1 1 1792 cycles LA 1 LA 2 1 1 1 LA 1 2048 cycles 1 1 1 1 1 LA 3 1 1 3 LA 2 1536 cycles 1 1 3 LA 4 1 Case Study : “Simple” benchmark LA 1 Loop graph TC=256 3 8

  9. Prescribed Throughput Accelerators • Traditional behavioral synthesis • Directly translate C operatorsinto gates • Our approach: Application-centric Architectures • Achieve fixed throughput • Maximize hardware sharing Operation graph Datapath Application Architecture 9

  10. Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 10

  11. Modulo Schedule Scheduled Ops Build Datapath FUs Synthesize Instantiate Arch Op1 Op2 Op3 … time .v FU FU Loop Accelerator Verilog, Control Signals Concrete Arch Loop Accelerator Design Flow FU Alloc FU FU .c RF C Code, Performance (Throughput) Abstract Arch 11

  12. Loop Accelerator Loop Accelerator LA1 LA1 Multifunction Loop Accelerator LA2 LA2 LA3 Multifunction Loop Accelerator LA4 LA3 … LA5 Accelerator Pipeline Accelerator Pipeline Multifunction Accelerator • Map multiple loops to single accelerator • Improve hardware efficiency via reuse • Opportunities for sharing • Disjoint stages(loops 2, 3) • Pipeline slack(loops 4, 5) Loop 1 Frame Type? Loop 2 Loop 3 Loop 4 Block 5 … Application 12

  13. DatapathUnion FU FU Union Cost SensitiveModulo Scheduler FU FU Loop 1 Cost SensitiveModulo Scheduler FU FU Loop 2 • 43% average savings over sum of accelerators • Smart union within 3% of joint scheduling solution 13

  14. Challenges: Throughput Enabling Transformations • Algorithm-level pipeline retiming • Splitting loops based on tiling • Co-scheduling adjacent loops Loop 1 Loop 1 Critical loop Loop 2 Loop 2a Critical loop Loop 2b Loop 3 Loop 3,4 Loop 4 14

  15. Challenges: Programmable Loop Accelerator • Support bug fixes, evolving standards • Accelerate loops not known at design time • Minimize additional control overhead Interconnect … … … … … … II Local Mem Control FU FU MEM Controlsignals 15

  16. Challenges: Timing Aware Synthesis • Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance • Strategies to eliminate long wires • Preemptive: predict & prevent long wires • Reactive: use feedback from floorplanner - Insert flip flop on long path - Reschedule with added latency FU1 FU2 FU3 16

  17. Challenges: Adaptable Voltage/Frequency Levels flip-flop • Allow voltage scaling beyond margins • Using shadow latches in loop accelerator • Localized error detection • Control is predefined: simple error recovery D Q CLK error delay shadowlatch FU FU Shadowlatch Extra queueentries 17

  18. For More Information • Visit http://cccp.eecs.umich.edu 18

More Related