Compiler-directed Synthesis of Multifunction Loop Accelerators

Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan 1

Accelerating Streaming Applications DRAM • Streaming applications: • Discrete transformations operating on data stream • High performance • Map application to pipeline of accelerators • Multifunction accelerators reuse hardware • Improve hardware efficiency Loop 1 Loop Accelerator LA1 Frame Type? Loop 2 Loop 3 Multifunction Loop Accelerator LA2 Loop 4 Multifunction Loop Accelerator LA3 Block 5 … … Accelerator Pipeline Application 2

Loop Accelerator Schema • Hard wired state machine for one or more critical loops • Order of magnitude power and performance improvements over more general designs 3

Single Function Accelerator Design • Use compiler as architecture synthesis tool • Parameterized meta-architecture – all loop accelerators have same general organization • Performance/throughput is input • Compiler analysis to understand computation and communication requirements • Hardware-sensitive optimization to reduce cost 4

FUs Op1 Op2 Op3 … time Flow Diagram Application Loop, Desired II FU FU Allocate FUs Concrete Arch FU FU Instantiate Arch Abstract Arch RF Modulo Schedule Verilog, Control Signals Scheduled Ops Synthesize Build Datapath Loop Accelerator 5

FU Allocation • Given operations in a loop and cost of hardware cells implementing those operations • Minimize total FU cost while supporting all operations II = 2 3  ADD 1  SUB 2  LOAD + + - MEM 6

12 FU1 FU2 LOAD time 1 MEM + . . . ADD time 4 Schedule Datapath Modulo Scheduling andDatapath Derivation • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 7

Multifunction Accelerator • Single hardware accelerator to run multiple loops • Could place single function accelerators side by side • Want to exploit potential hardware sharing between loops • Function units • Registers • Interconnect 8

Multifunction Design Strategies 1. Union Method FU FU FU FU FU FU 2. Phase Ordered Method + FU FU FU FU 9

Smart Union + - M M + + * M Multi- function accel + +/- M/* M + */- M/+ M Storage cost: 15 Storage cost: 11 Union Method Goal: combine FUs and register files to improve hardware sharing. Positional Union + - M M Accel 1 Accel 2 + + * M 10

Union Method • Smart union formulated as ILP problem which minimizes FU and register cost • Benefit: Look at whole design at once • Limitation: Schedules are fixed prior to union phase • Fast runtime 11

Cost of Union of Accelerators Image Processing MPEG4 Signal Processing Worst union: 25% average savings Positional union: 29% average savings Best union: 33% average savings 12

Phase Ordered Method • Schedule loops in order • During scheduling, account for hardware from previous loop • Cost sensitive scheduler attempts to minimize hardware cost increase + FU FU FU FU Loop 1 Accel 1 Loop 2 Accel 1+2 13

FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Different valid scheduling alternatives are not equal FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 14

Loop 2 1 2 Modulo Scheduler 3 4 5 Alti Costi Loop 1 Hardware Partial Hardware for Scheduled Ops Estimate for Unscheduled Ops HW CostLibrary + + * + Greedy Cost Sensitive Scheduler • Select scheduling alternative with minimum cost • Account for estimated cost of unscheduled ops Loop 1 1 2 Modulo Scheduler 3 4 5 Alti Costi Hardware Cost Modeler 15

Phase Ordered Method • Extend conventional iterative modulo scheduler with hardware cost model • Benefits: • Scheduler is aware of hardware for all previously scheduled loops • Can adjust schedule to improve cost savings • Limitation: process is localized, greedy. Schedules of previous loops are fixed • Fast runtime 16

Cost Sensitive Scheduling Comparison Image Processing MPEG4 Signal Processing Greedy scheduling: 41% average savings ILP scheduling: 51% average savings 17

Union vs. Phase Ordered Methods Image Processing MPEG4 Signal Processing Union method: 45% average savings Phase ordered method: 41% average savings 18

Conclusion • Compiler-directed design system • Multifunction accelerator for hardware reuse • Two multifunction design methods • Smart union of single-function accelerators: 45% average cost savings • Phase ordered scheduling: 41% average cost savings • Overall, 20 – 61% hardware savings from sharing 19

Questions? 20

Compiler-directed Synthesis of Multifunction Loop Accelerators

Compiler-directed Synthesis of Multifunction Loop Accelerators

Presentation Transcript

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

Synthesis directed towards antimicrobial purines

Increasing Hardware Efficiency with Multifunction Loop Accelerators

Multifunction wallet

V. Transcription (DNA-directed RNA synthesis)

Weakest Precondition Synthesis for Compiler Optimizations

Program Synthesis for Low-Power Accelerators

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Multifunction

Compiler-directed Synthesis of Programmable Loop Accelerators

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Compiler-Directed instruction cache leakage optimizations

Relational Verification to SIMD Loop Synthesis

Synthesis of Loop-free Programs

Compiler and System Techniques for SoC distributed accelerators

Compiler-directed Data Partitioning for Multicluster Processors

Compiler-in-the-Loop Exploration of Programmable Embedded Systems

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Continued