230 likes | 349 Views
Compiler-directed Synthesis of Programmable Loop Accelerators. Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop. Loop Accelerators. Hardware implementation of a critical loop nest Hardwired state machine Digital camera appln – 1000x vs Pentium III
E N D
Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop
Loop Accelerators • Hardware implementation of a critical loop nest • Hardwired state machine • Digital camera appln – 1000x vs Pentium III • Multiple accelerators hooked up in a pipeline • Loop accelerator vs. customized processor • 1 block of code vs. multiple blocks • Trivial control flow vs. handling generic branches • Traditionally state machine vs. instruction driven
Programmable Loop Accelerators • Goals • Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use) • Post-programmable – To a degree, allow changes to the application • Use compiler as architecture synthesis tool • But … • Don’t build a customized processor • Maintain ASIC-level efficiency
PICO Frontend for i = 1 to ni • Goals • Exploit loop-level parallelism • Map loop to abstract hardware • Manage global memory BW • Steps • Tiling • Load/store elimination • Iteration mapping • Iteration scheduling • Virtual processor clustering for j = 1 to nj y[i] += w[j] * x[i+j] for jt = 1 to 100 step 10 for t = 0 to 502 for p = 0 to 1 (i,j) = function of (t,p) if (i>1) W[t][p] = W[t-5][p] else w[jt+j] if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j] Y[t][p] += W[t][p] * X[t][p]
PICO Backend • Resource allocation (II, operation graph) • Synthesize machine description for “fake” fully connected processor with allocated resources
PICO Methodology – Why it Works? • Systematic design methodology • 1. Parameterized meta-architecture – all NPAs have same general organization • 2. Performance/throughput is input • 3. Abstract architecture – We know how to build compilers for this • 4. Mapping mechanism – Determine architecture specifics from schedule for abstract architecture
Direct Generalization of PICO? • Programmability would require full interconnect between elements • Back to the meta architecture! • Generalize connectivity to enable post-programmability • But stylize it
Programmable Loop Accelerator – Design Strategy • Compile for partially defined architecture • Build long distance communication into schedule • Limit global communication bandwidth • Proposed meta-architecture • Multi-cluster VLIW • Explicit inter-cluster transfers (varying latency/BW) • Intra-cluster communication is complete • Hardware partially defined – expensive units
Programmable Loop Accelerator Schema DRAM Shift Register II Stream Unit SRAM Control Unit FU MEM Accelerator … … … … Intra-cluster Communication … … … … Stream Buffer Stream Unit FU FU Accelerator Inter-cluster Register File … Accelerator Datapath Pipeline of Tiled or Clustered Accelerators
Flow Diagram # cheap FUs FUs assigned to clusters Assembly code, II Modulo Schedule FU Alloc Shift register depth, width, porting Intercluster bandwidth # clusters # expensive FUs Loop Accelerator Partition
Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp; t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2]; e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22)); e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp; } }
FU Allocation • Sobel with II=4 41 ops 3 clusters 2 MPY ops 1 multiplier 9 memory ops 3 memory units • Determine number of clusters: • Determine number of expensive FUs • MPY, DIV, memory
Partitioning • Multi-level approach consists of two phases • Coarsening • Refinement • Minimize inter-cluster communication • Load balance • Max of 4 II operations per cluster • Take FU allocation into account • Restricted # of expensive units • # of cheap units (ADD, logic) determined from partition
L L L L L L L L L L L L L L L L L L L L + + + + + + + + + + + + + + + + + + + + + + + + Coarsening • Group highly related operations together • Pair operations together at each step • Forces partitioner to consider several operations as a single unit • Coarsening Sobel subgraph into 2 groups:
? L L L L L + + + + + + Refinement • Move operations between clusters • Good moves: • Reduce inter-cluster communication • Improve load balance • Reduce hardware cost • Reduce number of expensive units to meet limit • Collect similar bitwidth operations together
From sobel, II=4 Place MPYs together Place each tree of ADD-LOAD-ADDs together Cuts 6 edges Partitioning Example
Modulo Scheduling • Determines shift register width, depth, and number of read ports • Sobel II=4 FU0 FU1 FU2 FU3 cycle ADD 0 LD 1 ADD 2 LD ADD ADD 3
Test Cases • Sobel and fsed kernels, II=4 designs • Each machine has 4 clusters with 4 FUs per cluster M + - M + - M + - B << sobel + - + - + - + - * & + - << M + - M + - M + & B + - fsed + - << + - << + & + & *
Cross Compile Results • Computation is localized • sobel: 1.5 moves/cycle • fsed: 1 move/cycle • Cross compile • Can still achieve II=4 • More inter-cluster communication • May require more units • sobel on fsed machine: ~2 moves/cycle • fsed on sobel machine: ~3 moves/cycle
Concluding Remarks • Programmable loop accelerator design strategy • Meta-architecture with stylized interconnect • Systematic compiler-directed design flow • Costs of programmability: • Interconnect, inter-cluster communication • Control – “micro-instructions” are necessary • Just scratching the surface of this work • For more, see the CCCP group webpage • http://cccp.eecs.umich.edu