1 / 20

Compiler-directed Synthesis of Multifunction Loop Accelerators

Compiler-directed Synthesis of Multifunction Loop Accelerators. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. Accelerating Streaming Applications. DRAM. Streaming applications:

tegan
Download Presentation

Compiler-directed Synthesis of Multifunction Loop Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan 1

  2. Accelerating Streaming Applications DRAM • Streaming applications: • Discrete transformations operating on data stream • High performance • Map application to pipeline of accelerators • Multifunction accelerators reuse hardware • Improve hardware efficiency Loop 1 Loop Accelerator LA1 Frame Type? Loop 2 Loop 3 Multifunction Loop Accelerator LA2 Loop 4 Multifunction Loop Accelerator LA3 Block 5 … … Accelerator Pipeline Application 2

  3. Loop Accelerator Schema • Hard wired state machine for one or more critical loops • Order of magnitude power and performance improvements over more general designs 3

  4. Single Function Accelerator Design • Use compiler as architecture synthesis tool • Parameterized meta-architecture – all loop accelerators have same general organization • Performance/throughput is input • Compiler analysis to understand computation and communication requirements • Hardware-sensitive optimization to reduce cost 4

  5. FUs Op1 Op2 Op3 … time Flow Diagram Application Loop, Desired II FU FU Allocate FUs Concrete Arch FU FU Instantiate Arch Abstract Arch RF Modulo Schedule Verilog, Control Signals Scheduled Ops Synthesize Build Datapath Loop Accelerator 5

  6. FU Allocation • Given operations in a loop and cost of hardware cells implementing those operations • Minimize total FU cost while supporting all operations II = 2 3  ADD 1  SUB 2  LOAD + + - MEM 6

  7. 12 FU1 FU2 LOAD time 1 MEM + . . . ADD time 4 Schedule Datapath Modulo Scheduling andDatapath Derivation • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 7

  8. Multifunction Accelerator • Single hardware accelerator to run multiple loops • Could place single function accelerators side by side • Want to exploit potential hardware sharing between loops • Function units • Registers • Interconnect 8

  9. Multifunction Design Strategies 1. Union Method FU FU FU FU FU FU 2. Phase Ordered Method + FU FU FU FU 9

  10. Smart Union + - M M + + * M Multi- function accel + +/- M/* M + */- M/+ M Storage cost: 15 Storage cost: 11 Union Method Goal: combine FUs and register files to improve hardware sharing. Positional Union + - M M Accel 1 Accel 2 + + * M 10

  11. Union Method • Smart union formulated as ILP problem which minimizes FU and register cost • Benefit: Look at whole design at once • Limitation: Schedules are fixed prior to union phase • Fast runtime 11

  12. Cost of Union of Accelerators Image Processing MPEG4 Signal Processing Worst union: 25% average savings Positional union: 29% average savings Best union: 33% average savings 12

  13. Phase Ordered Method • Schedule loops in order • During scheduling, account for hardware from previous loop • Cost sensitive scheduler attempts to minimize hardware cost increase + FU FU FU FU Loop 1 Accel 1 Loop 2 Accel 1+2 13

  14. FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Different valid scheduling alternatives are not equal FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 14

  15. Loop 2 1 2 Modulo Scheduler 3 4 5 Alti Costi Loop 1 Hardware Partial Hardware for Scheduled Ops Estimate for Unscheduled Ops HW CostLibrary + + * + Greedy Cost Sensitive Scheduler • Select scheduling alternative with minimum cost • Account for estimated cost of unscheduled ops Loop 1 1 2 Modulo Scheduler 3 4 5 Alti Costi Hardware Cost Modeler 15

  16. Phase Ordered Method • Extend conventional iterative modulo scheduler with hardware cost model • Benefits: • Scheduler is aware of hardware for all previously scheduled loops • Can adjust schedule to improve cost savings • Limitation: process is localized, greedy. Schedules of previous loops are fixed • Fast runtime 16

  17. Cost Sensitive Scheduling Comparison Image Processing MPEG4 Signal Processing Greedy scheduling: 41% average savings ILP scheduling: 51% average savings 17

  18. Union vs. Phase Ordered Methods Image Processing MPEG4 Signal Processing Union method: 45% average savings Phase ordered method: 41% average savings 18

  19. Conclusion • Compiler-directed design system • Multifunction accelerator for hardware reuse • Two multifunction design methods • Smart union of single-function accelerators: 45% average cost savings • Phase ordered scheduling: 41% average cost savings • Overall, 20 – 61% hardware savings from sharing 19

  20. Questions? 20

More Related