1 / 23

Compiler-directed Synthesis of Programmable Loop Accelerators

Compiler-directed Synthesis of Programmable Loop Accelerators. Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop. Loop Accelerators. Hardware implementation of a critical loop nest Hardwired state machine Digital camera appln – 1000x vs Pentium III

enrico
Download Presentation

Compiler-directed Synthesis of Programmable Loop Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop

  2. Loop Accelerators • Hardware implementation of a critical loop nest • Hardwired state machine • Digital camera appln – 1000x vs Pentium III • Multiple accelerators hooked up in a pipeline • Loop accelerator vs. customized processor • 1 block of code vs. multiple blocks • Trivial control flow vs. handling generic branches • Traditionally state machine vs. instruction driven

  3. Programmable Loop Accelerators • Goals • Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use) • Post-programmable – To a degree, allow changes to the application • Use compiler as architecture synthesis tool • But … • Don’t build a customized processor • Maintain ASIC-level efficiency

  4. NPA (Nonprogrammable Accelerator) Synthesis in PICO

  5. PICO Frontend for i = 1 to ni • Goals • Exploit loop-level parallelism • Map loop to abstract hardware • Manage global memory BW • Steps • Tiling • Load/store elimination • Iteration mapping • Iteration scheduling • Virtual processor clustering for j = 1 to nj y[i] += w[j] * x[i+j] for jt = 1 to 100 step 10 for t = 0 to 502 for p = 0 to 1 (i,j) = function of (t,p) if (i>1) W[t][p] = W[t-5][p] else w[jt+j] if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j] Y[t][p] += W[t][p] * X[t][p]

  6. PICO Backend • Resource allocation (II, operation graph) • Synthesize machine description for “fake” fully connected processor with allocated resources

  7. Reduced VLIW Processor after Modulo Scheduling

  8. Data/control-path Synthesis  NPA

  9. PICO Methodology – Why it Works? • Systematic design methodology • 1. Parameterized meta-architecture – all NPAs have same general organization • 2. Performance/throughput is input • 3. Abstract architecture – We know how to build compilers for this • 4. Mapping mechanism – Determine architecture specifics from schedule for abstract architecture

  10. Direct Generalization of PICO? • Programmability would require full interconnect between elements • Back to the meta architecture! • Generalize connectivity to enable post-programmability • But stylize it

  11. Programmable Loop Accelerator – Design Strategy • Compile for partially defined architecture • Build long distance communication into schedule • Limit global communication bandwidth • Proposed meta-architecture • Multi-cluster VLIW • Explicit inter-cluster transfers (varying latency/BW) • Intra-cluster communication is complete • Hardware partially defined – expensive units

  12. Programmable Loop Accelerator Schema DRAM Shift Register II Stream Unit SRAM Control Unit FU MEM Accelerator … … … … Intra-cluster Communication … … … … Stream Buffer Stream Unit FU FU Accelerator Inter-cluster Register File … Accelerator Datapath Pipeline of Tiled or Clustered Accelerators

  13. Flow Diagram # cheap FUs FUs assigned to clusters Assembly code, II Modulo Schedule FU Alloc Shift register depth, width, porting Intercluster bandwidth # clusters # expensive FUs Loop Accelerator Partition

  14. Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp; t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2]; e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22)); e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp; } }

  15. FU Allocation • Sobel with II=4 41 ops  3 clusters 2 MPY ops  1 multiplier 9 memory ops  3 memory units • Determine number of clusters: • Determine number of expensive FUs • MPY, DIV, memory

  16. Partitioning • Multi-level approach consists of two phases • Coarsening • Refinement • Minimize inter-cluster communication • Load balance • Max of 4  II operations per cluster • Take FU allocation into account • Restricted # of expensive units • # of cheap units (ADD, logic) determined from partition

  17. L L L L L L L L L L L L L L L L L L L L + + + + + + + + + + + + + + + + + + + + + + + + Coarsening • Group highly related operations together • Pair operations together at each step • Forces partitioner to consider several operations as a single unit • Coarsening Sobel subgraph into 2 groups:

  18. ? L L L L L + + + + + + Refinement • Move operations between clusters • Good moves: • Reduce inter-cluster communication • Improve load balance • Reduce hardware cost • Reduce number of expensive units to meet limit • Collect similar bitwidth operations together

  19. From sobel, II=4 Place MPYs together Place each tree of ADD-LOAD-ADDs together Cuts 6 edges Partitioning Example

  20. Modulo Scheduling • Determines shift register width, depth, and number of read ports • Sobel II=4 FU0 FU1 FU2 FU3 cycle ADD 0 LD 1 ADD 2 LD ADD ADD 3

  21. Test Cases • Sobel and fsed kernels, II=4 designs • Each machine has 4 clusters with 4 FUs per cluster M + - M + - M + - B << sobel + - + - + - + - * & + - << M + - M + - M + & B + - fsed + - << + - << + & + & *

  22. Cross Compile Results • Computation is localized • sobel: 1.5 moves/cycle • fsed: 1 move/cycle • Cross compile • Can still achieve II=4 • More inter-cluster communication • May require more units • sobel on fsed machine: ~2 moves/cycle • fsed on sobel machine: ~3 moves/cycle

  23. Concluding Remarks • Programmable loop accelerator design strategy • Meta-architecture with stylized interconnect • Systematic compiler-directed design flow • Costs of programmability: • Interconnect, inter-cluster communication • Control – “micro-instructions” are necessary • Just scratching the surface of this work • For more, see the CCCP group webpage • http://cccp.eecs.umich.edu

More Related