200 likes | 337 Views
Compiler-directed Synthesis of Multifunction Loop Accelerators. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. Accelerating Streaming Applications. DRAM. Streaming applications:
E N D
Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan 1
Accelerating Streaming Applications DRAM • Streaming applications: • Discrete transformations operating on data stream • High performance • Map application to pipeline of accelerators • Multifunction accelerators reuse hardware • Improve hardware efficiency Loop 1 Loop Accelerator LA1 Frame Type? Loop 2 Loop 3 Multifunction Loop Accelerator LA2 Loop 4 Multifunction Loop Accelerator LA3 Block 5 … … Accelerator Pipeline Application 2
Loop Accelerator Schema • Hard wired state machine for one or more critical loops • Order of magnitude power and performance improvements over more general designs 3
Single Function Accelerator Design • Use compiler as architecture synthesis tool • Parameterized meta-architecture – all loop accelerators have same general organization • Performance/throughput is input • Compiler analysis to understand computation and communication requirements • Hardware-sensitive optimization to reduce cost 4
FUs Op1 Op2 Op3 … time Flow Diagram Application Loop, Desired II FU FU Allocate FUs Concrete Arch FU FU Instantiate Arch Abstract Arch RF Modulo Schedule Verilog, Control Signals Scheduled Ops Synthesize Build Datapath Loop Accelerator 5
FU Allocation • Given operations in a loop and cost of hardware cells implementing those operations • Minimize total FU cost while supporting all operations II = 2 3 ADD 1 SUB 2 LOAD + + - MEM 6
12 FU1 FU2 LOAD time 1 MEM + . . . ADD time 4 Schedule Datapath Modulo Scheduling andDatapath Derivation • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 7
Multifunction Accelerator • Single hardware accelerator to run multiple loops • Could place single function accelerators side by side • Want to exploit potential hardware sharing between loops • Function units • Registers • Interconnect 8
Multifunction Design Strategies 1. Union Method FU FU FU FU FU FU 2. Phase Ordered Method + FU FU FU FU 9
Smart Union + - M M + + * M Multi- function accel + +/- M/* M + */- M/+ M Storage cost: 15 Storage cost: 11 Union Method Goal: combine FUs and register files to improve hardware sharing. Positional Union + - M M Accel 1 Accel 2 + + * M 10
Union Method • Smart union formulated as ILP problem which minimizes FU and register cost • Benefit: Look at whole design at once • Limitation: Schedules are fixed prior to union phase • Fast runtime 11
Cost of Union of Accelerators Image Processing MPEG4 Signal Processing Worst union: 25% average savings Positional union: 29% average savings Best union: 33% average savings 12
Phase Ordered Method • Schedule loops in order • During scheduling, account for hardware from previous loop • Cost sensitive scheduler attempts to minimize hardware cost increase + FU FU FU FU Loop 1 Accel 1 Loop 2 Accel 1+2 13
FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Different valid scheduling alternatives are not equal FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 14
Loop 2 1 2 Modulo Scheduler 3 4 5 Alti Costi Loop 1 Hardware Partial Hardware for Scheduled Ops Estimate for Unscheduled Ops HW CostLibrary + + * + Greedy Cost Sensitive Scheduler • Select scheduling alternative with minimum cost • Account for estimated cost of unscheduled ops Loop 1 1 2 Modulo Scheduler 3 4 5 Alti Costi Hardware Cost Modeler 15
Phase Ordered Method • Extend conventional iterative modulo scheduler with hardware cost model • Benefits: • Scheduler is aware of hardware for all previously scheduled loops • Can adjust schedule to improve cost savings • Limitation: process is localized, greedy. Schedules of previous loops are fixed • Fast runtime 16
Cost Sensitive Scheduling Comparison Image Processing MPEG4 Signal Processing Greedy scheduling: 41% average savings ILP scheduling: 51% average savings 17
Union vs. Phase Ordered Methods Image Processing MPEG4 Signal Processing Union method: 45% average savings Phase ordered method: 41% average savings 18
Conclusion • Compiler-directed design system • Multifunction accelerator for hardware reuse • Two multifunction design methods • Smart union of single-function accelerators: 45% average cost savings • Phase ordered scheduling: 41% average cost savings • Overall, 20 – 61% hardware savings from sharing 19
Questions? 20