180 likes | 276 Views
Increasing Hardware Efficiency with Multifunction Loop Accelerators. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006. Introduction. Emerging applications have high performance, cost, energy demands
E N D
Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006 1
Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data CPU Accelerators 2
Automated C gates solution • Correct by construction • Close designer productivity gap • Achieve short time-to-market .C Loop Accelerators • Order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 3
Our approach: Application-centric Architectures • Achieve fixed throughput • Maximize hardware sharing Application Architecture Prescribed Throughput Accelerators • Traditional behavioral synthesis • Directly translate C operatorsinto gates Operation graph Datapath 4
Outline • Loop accelerator schema and design flow • Cost sensitive scheduling • Designing multifunction accelerators • Naïve • Joint scheduling • Datapath union • Synthesis results 5
Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 6
Modulo Schedule Scheduled Ops Build Datapath FUs Synthesize Instantiate Arch Op1 Op2 Op3 … time .v FU FU Loop Accelerator Verilog, Control Signals Concrete Arch Loop Accelerator Design Flow FU Alloc FU FU .c RF C Code, Performance (Throughput) Abstract Arch 7
12 FU1 FU2 time 1 LOAD MEM + . . . ADD time 4 Schedule Datapath Datapath Derived from Schedule • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 8
FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Traditional scheduling is hardware unaware • Intelligent scheduling needed to reduce hardware cost • 27% cost reduction with same performance [MICRO ’05] FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 9
Loop Accelerator Loop Accelerator LA1 LA1 Multifunction Loop Accelerator LA2 LA2 LA3 Multifunction Loop Accelerator LA4 LA3 … LA5 Accelerator Pipeline Accelerator Pipeline Multifunction Accelerator • Map multiple loops to single accelerator • Improve hardware efficiency via reuse • Opportunities for sharing • Disjoint stages(loops 2, 3) • Pipeline slack(loops 4, 5) Loop 1 Frame Type? Loop 2 Loop 3 Loop 4 Block 5 … Application 10
Design Strategies • Naïve method: Design single function accelerators, place side by side • Misses potential hardware sharing of FUs, storage, interconnect Cost SensitiveModulo Scheduler FU FU Loop 1 FU FU FU FU Cost SensitiveModulo Scheduler FU FU Loop 2 Multifunction datapath 11
FU FU FUs FUs Op1 Op2 Op3 … Op2 Op1 … Op3 time time Joint Scheduling • Loops are independent: # possible schedules exponential in # of loops! • Infeasible for modest problems Loop 1 JointCost SensitiveModulo Scheduler Loop 2 12
Multifunction Gate Costs A B C D E F G H I J • 43% average savings over sum of accelerators 13
DatapathUnion FU FU Datapath Union Cost SensitiveModulo Scheduler FU FU Loop 1 Cost SensitiveModulo Scheduler FU FU Loop 2 14
+ * M + + */- +/- M M/* M/+ M Datapath Union • Combine similar components→ better hardware sharing→ lower cost • Trade off FU and register cost • Combining dissimilar FUs can enable register cost savings • ILP formulation minimizes FU and register cost + - M M Accel 1 Accel 2 + Multi- function accel 15
Multifunction Gate Costs A B C D E F G H I J • Smart union within 3% of joint scheduling solution 16
Conclusion • Multifunction accelerators highly effective in exploiting coarse grained hardware sharing • Joint scheduling achieves 43% average cost savings, but is impractical • Smart union of independent accelerators achieves 40% average savings • Compile times of 5 minutes – 1 hour 17
Questions? 18