180 likes | 301 Views
Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES. Ganesh Dasika 1 , Shidhartha Das 2 , Kevin Fan 1 , Scott Mahlke 1 , David Bull 2. 2 ARM Ltd. Cambridge United Kingdom. 1 University of Michigan Advanced Computer Architecture Laboratoy Ann Arbor, MI. Introduction.
E N D
Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika1, Shidhartha Das2, Kevin Fan1, Scott Mahlke1, David Bull2 2ARM Ltd. Cambridge United Kingdom 1University of Michigan Advanced Computer Architecture Laboratoy Ann Arbor, MI 1
Introduction [Austin, IEEE Computer March 04] 2
Razor • Allows for voltage/frequency scaling beyond first-failure point • Exploits difference between design-time conditions (“slow”) and actual conditions (“typical”) [Das, JSSC 2006] 3
Razor in General Purpose Processors • Requires detailed analysis of microarchitectural impact • Analyze what state should be stored • Lengthening pipeline for stabilization increases complexity of forwarding logic • Unpredictable control and data flow • Difficult to determine worst-case vectors 4
BLADES • Better-than-worst-case Loop Accelerator Design • Incorporate DVFS into ASICs using Razor • Shave off some of the high NRE using HLS • Develop generic methodology for any application • Razor solution for a templated architecture • Create ASIC design flow that is aware of Razor-ization costs 5
Loop Accelerator Template • Hardware realization of modulo-scheduled loop • Parameterized execution resources, storage, connectivity • Control is statically determined, simple and not timing-critical • Opportunity to make application-specific optimizations 6
Razor + * + * “Roll-back” muxes } R Added interconnect Extended register queues R is the number of extra entries required Function of max pipeline depth and error-detection delay Razorized Loop Accelerator 7
Razor + * + * Error “Life-Cycle” Error stabilization Error OR-tree Error … + Control Error processing … Error Reset Roll-back pipelining 8
Issues with Razor • Area, added hold-fixing D CLK t spec 9
+ I Time 0 Time 1 Time 2 Time 3 50% FU utilization removes hold-fixing need, but requires halving performance or doubling area FU 0 Add-Or0 Add-Or1 Time 0 Time 1 Time 2 FU 0 Add0 Add1 FU 1 Or0 Or1 Time 0 Time 1 Time 2 FU 0 Add0 Time 0 Time 1 Time 2 Time 3 Time 4 Time 5 FU 1 Add1 FU 0 Add0 Add0 Add1 Add1 Use hybrid scheme to execute >2 ops per FU FU 2 Or0 FU 1 Or0 Or0 Or1 Or1 FU 3 Or1 Opcode-chaining 10
Identifying Opcode Chains • Compiler identifies subgraphs of 3-4 input, 1 output instructions • All arith. ops supported • Greedy selection algorithm << << LD 2 1 LD + + << 6 4 >> >> >> << + 5 3 + + + + 7 + >> + & & ST ST 11
Enabled every 2 cycles << << LD << << LD 2 2 1 1 << LD + + << LD + + << 6 6 4 4 >> >> >> << + >> >> >> << + 5 5 Razor DFF + 3 3 + + + + + + + + 7 7 + >> + >> >> + + + & & & & + ST ST ST ST Custom FUs 12
idct, sharp, systolic_dct had multiple CFUs, and overall lower # of FUs Viterbi, dequant had signficant control-flow that restricted opportunities for creating custom ops 22% reduction in hold-fixing overhead in sobel Results 13
Conclusion • Application-specific optimizations definitely help to mitigate Razor costs • 24% reduction in overhead • 33% energy savings overall • Can optimize Razor-ization with further input from the compiler • Critical-instruction analysis • Error impact analysis 14
Thank you! http://cccp.eecs.umich.edu 15
Future Work • Errors in different FUs affect the system differently • Error “impact-analysis” • Data computation not necessarily error-sensitive • Address, branch target/direction critical to functionality • Razor-ization of arbitrary Verilog 16
Motivation • Using Razor has significant design overhead • Error-recovery system • Added “backup” state • Additional hold-time fixing • Modifications for different u-archs are different • Information about work-load cannot be used since design must preserve generality 17
+ * 18