Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan 1

app.c LA LA LA LA Automated C to Gates Solution • SoC design • 10-100 Gops, 200 mW power budget • Low level tools ineffective • Automated accelerator synthesis for whole application • Correct by construction • Increase designer productivity • Faster time to market 2

Transform Quantizer Coder Coded Image Image Inverse Quantizer Inverse Transform Motion Estimator Motion Predictor H.264 Encoder OVSF Generator Data out Baseband Trasmitter Data in Block Interleaver Conv./ Turbo CRC RRC Filter Spreader/ Scrambler W-CDMA Transmitter Streaming Applications • Data “streaming” through kernels • Kernels are tight loops • FIR, Viterbi, DCT • Coarse grain dataflow between kernels • Sub-blocks of images, network packets 3

Kernel 1 Kernel 1 Kernel 1 K2 K2 K2 K3 K3 K3 Kernel 4 Kernel 4 Kernel 4 Kernel 5 Kernel 5 Kernel 5 System Schema Overview LA 1 Kernel 1 Task throughput Kernel 2 Kernel 3 LA 2 time Kernel 4 Kernel 5 LA 3 4

inp row_trans tmp1 col_trans tmp2 zigzag_trans out Input Specification • System specification • Function with main input/output • Local arrays to pass data • Sequence of calls to kernels • Sequential C program • Kernel specification • Perfectly nested FOR loop • Wrapped inside C function • All data access made explicit row_trans(char inp[8][8], char out[8][8] ) { dct(char inp[8][8], char out[8][8]) { for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; } } char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out); } } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]); 5

K1 LA 1 K1 TC=100 100 K2 K2 200 TC=100 K1 K3 300 K4 K2 K3 400 TC=100 K3 K4 K3 TC=100 System Level Decisions • Throughput of each LA – Initiation Interval • Grouping of loops into a multifunction LA • More loops in a single LA → LA occupied for longer time in current task Throughput = 1 task / 200 cycles LA 2 LA 1 occupied for 200 cycles LA 3 6

LA 1 LA 1 K1 II=1 K1 TC=100 K1 100 100 tmp1 K2 K2 K1 LA 2 LA 2 K2 200 200 II=1 TC=100 K1 K3 K3 K2 tmp2 300 300 K2 K3 K3 II=1 TC=100 LA 3 LA 3 K3 System Decisions (Contd..) • Cost of SRAM buffers for intermediate arrays • More buffers → more task overlap → high performance tmp1 buffer in use by LA2 Adjacent tasks use different buffers 7

1 1 1 LA 1 512 cycles 1 1 1 2 1 1 1792 cycles LA 1 LA 2 1 1 1 LA 1 2048 cycles 1 1 1 1 1 LA 3 1 1 3 LA 2 1536 cycles 1 1 3 LA 4 1 Case Study : “Simple” benchmark LA 1 Loop graph TC=256 3 8

Prescribed Throughput Accelerators • Traditional behavioral synthesis • Directly translate C operatorsinto gates • Our approach: Application-centric Architectures • Achieve fixed throughput • Maximize hardware sharing Operation graph Datapath Application Architecture 9

Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 10

Modulo Schedule Scheduled Ops Build Datapath FUs Synthesize Instantiate Arch Op1 Op2 Op3 … time .v FU FU Loop Accelerator Verilog, Control Signals Concrete Arch Loop Accelerator Design Flow FU Alloc FU FU .c RF C Code, Performance (Throughput) Abstract Arch 11

Loop Accelerator Loop Accelerator LA1 LA1 Multifunction Loop Accelerator LA2 LA2 LA3 Multifunction Loop Accelerator LA4 LA3 … LA5 Accelerator Pipeline Accelerator Pipeline Multifunction Accelerator • Map multiple loops to single accelerator • Improve hardware efficiency via reuse • Opportunities for sharing • Disjoint stages(loops 2, 3) • Pipeline slack(loops 4, 5) Loop 1 Frame Type? Loop 2 Loop 3 Loop 4 Block 5 … Application 12

DatapathUnion FU FU Union Cost SensitiveModulo Scheduler FU FU Loop 1 Cost SensitiveModulo Scheduler FU FU Loop 2 • 43% average savings over sum of accelerators • Smart union within 3% of joint scheduling solution 13

Challenges: Throughput Enabling Transformations • Algorithm-level pipeline retiming • Splitting loops based on tiling • Co-scheduling adjacent loops Loop 1 Loop 1 Critical loop Loop 2 Loop 2a Critical loop Loop 2b Loop 3 Loop 3,4 Loop 4 14

Challenges: Programmable Loop Accelerator • Support bug fixes, evolving standards • Accelerate loops not known at design time • Minimize additional control overhead Interconnect … … … … … … II Local Mem Control FU FU MEM Controlsignals 15

Challenges: Timing Aware Synthesis • Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance • Strategies to eliminate long wires • Preemptive: predict & prevent long wires • Reactive: use feedback from floorplanner - Insert flip flop on long path - Reschedule with added latency FU1 FU2 FU3 16

Challenges: Adaptable Voltage/Frequency Levels flip-flop • Allow voltage scaling beyond margins • Using shadow latches in loop accelerator • Localized error detection • Control is predefined: simple error recovery D Q CLK error delay shadowlatch FU FU Shadowlatch Extra queueentries 17

For More Information • Visit http://cccp.eecs.umich.edu 18

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines