380 likes | 535 Views
Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems. Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego. Outline. Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions.
E N D
Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego
Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions
Reliability in MultiCore Systems • Modern multicore processors operate at multiple operating modes • E.g., nominal, supply voltage scaling, turbo, etc. • Reliability is a key processor design consideration at leading-edge technology nodes to guarantee a prescribed system lifetime • Task scheduling affects how cores are used • A subset of cores can fail before others
Scheduling in Multicore Systems • Scheduler packs tasks using some or all the available processing cores Application A #Cores 4 2 1 Time Application B #Cores 4 3 2 1 1 Time
Core Wearout • Mean time to failure (MTTF) is a measure of the lifetime of a core • Reliability mechanisms degrade MTTF of a core • E.g., electromigration (EM), stress migration, hot carrier injection, bias temperature instability, etc. • When all cores are not simultaneously active • Adjust task scheduling on a subset of active cores for balanced wearout
Impact of Overdrive Frequency • Frequency due to overclocking the cores to meet performance and throughput requirements • Overdrive frequencies cause faster MTTF degradation • Two challenges • Can violate “acceptable throughput” for tasks • Cores fail before all assigned tasks are completed • Can violate minimum “acceptable performance” for tasks • Cores operate at lower frequencies
Terminology • Power-on-hours () • Effective number of lifetime hours consumed • Measure of a core’s lifetime degradation due to operating conditions, e.g., temperature, frequency • Nominal temperature • Temperature at which MTTF degradation is the same as the number of hours a core is active • Acceleration factor (AF)
Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions
Classification of Existing Works (N)RC – (Non-) Reliability Constrained (N)LG – (No) Lifetime Guarantee (N)PG – (No) Performance Guarantee
Counterexample to NRC Policies • Task schedule • Max frequency = 3GHz • Min acceptable frequency = 1.8GHz • Initial lifetime = 7 years (61320h) • All cores operate always at 3GHz • From HotSpot simulations, AF = 9.77 • Lifetime after nominal tasks requiring m = 3 is 24947.5h • Tasks requiring m = 3 cannot complete overdrive execution • Tasks requiring m = 4 cannot complete at all Cannot guarantee “acceptable throughput” !!!
Counterexample to RC-LG Policies • Task schedule • Max frequency = 3GHz • Min acceptable frequency = 1.8GHz • Initial lifetime = 61320h • All cores operate initially at 3GHz, and then at 1.6GHz • From HotSpot simulations, AF = 9.77 • All tasks are completed but • Tasks requiring m = 3, 4 operate at 1.6GHz < 1.8GHz (acceptable performance) !!! Cannot guarantee “acceptable performance” !!!
Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions
What Do We Do Differently? • We formulate a new Maximum-Value Reliability-Constrained Overdrive Frequencies (MVRCOF) optimization (offline) problem • Important because • Overdrive frequencies are our optimization variables • User experience is the value • We guarantee prescribed levels of “acceptable performance” and “acceptable throughput”
Comparison of Ours vs. Existing Works (N)RC – (Non-) Reliability Constrained (N)LG – (No) Lifetime Guarantee (N)PG – (No) Performance Guarantee
What is the Optimal Solution? • Task schedule • Max frequency = 3GHz • Min acceptable frequency = 1.8GHz • Initial lifetime = 61320h • Optimal (discretized) solution from exhaustive search We guarantee both “acceptable performance” and “acceptable throughput” if a solution exists!!!
Our Key Contributions • We develop a new MVRCOF formulation to maximize the value of operating multiple cores at overdrive frequencies • Our solutions provide guarantees for prescribed lower bounds on “acceptable performance” and “acceptable throughput” • We propose optimal (discretized) solution using exhaustive search as well as an approximate heuristic flow • Our solutions determine optimal overdrive frequencies as well as execution times for each active core • We empirically determine that our optimal solutions improve the objective function value by up to 17.4% versus existing works
Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions
Formulation In English The value of operating at overdrive frequencies () described by weights () and the duration () The value of operating at nominal frequencies () described by weights () and the duration ()
Formulation In English Guarantees minimum “acceptable performance” () and upper bounded by the maximum achievable frequency () Guarantees “acceptable throughput”, i.e., all tasks complete within lifetime and cores wearout in a balanced manner Upper bound on instantaneous power dissipated by any core Upper bound on instantaneous temperature of all actives cores
MVRCOF Inputs: Task Description App 1 Scheduler App 2 App X Execution times in nominal and overdrive modes with different number of active cores El,m wl,m fnom,m Weights in nominal and overdrive modes with different number of active cores Nominal frequencies at different number of active cores
MVRCOF Inputs: System Description SoC Designer Number of available symmetric cores N Pmax fmax Tmax Tnom MTTF Maximum power of any core Maximum frequency of any core Maximum die temperature Nominal temperature Initial MTTF of any core
MVRCOF Outputs MVRCOF solver fOD,m vj,m,l ui,l Optimal overdrive frequencies for each set of active cores %execution time in each combination of the active cores E.g., in a system with three available cores, two cores can be active in ways %lifetime each core operates at nominal and overdrive modes
MVRCOF Inputs and Outputs App 1 SoC Designer Scheduler App 2 App X N Pmaxfmax TmaxTnom MTTF El,mwl,m fnom,m MVRCOF solver Task Description System Description fOD,m vj,m,lui,l Outputs
Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions
Optimal (Discretized) Solution Flow • For each core • For each combination in which the core is active • Choose discrete values of overdrive frequencies within a range • Perform power and temperature simulations • Create a one-time LUT • Example: • If a system has 3 cores (Core A, B, C), the number of active cores can be 1, 2 or 3 • Core A is active • One (out of three) combination when ; two (out of three) combinations when ; one (out of one) combination when • Perform exhaustive seach using the LUT for optimal overdrive frequencies that maximize the value of the objective function
Heuristic Flow • We maximize the overdrive frequency in the order of the set of active cores for which the product of weights and execution times is maximum • Example: • If a system has 3 cores, the number of active cores can be 1, 2 or 3 • If , we maximizeand • This achieves large improvements in the value of the objective function
Outline • Motivation • Previous Work • Our Work • Problem Statement • Optimal (Discretized) Solution Flow • Results • Conclusions
Experimental Setup • Each core is simulated with 72 copies of jpeg_encoder from OpenCores • SP&R implementation with commercial tools and foundry 45nm libraries • Power simulation using Synopsys PrimeTime-PX • Increase voltage from 0.8V to 1.2V in steps of 10mV • Increase frequency from 1.5GHz to 3GHz in steps of 50MHz • Thermal simulation using HotSpot • LP solver is lp_solve • Baseline policy is RC-LG from existing works
Testcases • Testcases are described by • Eight testcases in total • Format is -Testcase# • Seven have optimal solutions • One does not have feasible solution • Example
Optimal, Heuristic vs. RC-LG -12% -9% sw
Runtime Comparison 2.5 2.3 2.5 10
Outline • Motivation • Previous Work • Our Work • Problem Statement • Optimal (Discretized) Solution Flow • Results • Conclusions
Conclusions • We formulate and solve a new MVRCOF problem under lifetime reliability constraints • We develop MVRCOF solver that implements our optimal (discretized) and heuristic flows • Our optimal solutions guarantee both “acceptable performance” and “acceptable throughput” • We empirically demonstrate that our optimal solutions achieve up to 17.4% greater value of the objective function than existing works • Our future works include • Application of our methods to traces from actual server workloads • Expand our methods to handle other objectives • Achieve solutions that are temperature history-aware
Notation • number of simultaneously active cores • number of symmetric cores in a system • index for a core, • overdrive and nominal frequencies when cores are active • weights of achieved for overdrive and nominal frequencies • execution time in overdrive and nominal frequencies • maximum achievable frequency of any core • maximum power consumption of any core • maximum die temperature
Optimal Solution Flow Thermal simulation Power simulation Power(fOD,m) fOD,m For each corei, fOD,m and combination j of m (fOD,m, temp, AF) LUT fOD,m Core (m, j) AF Temp 1 Exhaustive Search Optimal objfn value, fOD,m and tj,m,l LP