1 / 38

Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems. Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego. Outline. Motivation Previous Work Our Work Problem Formulation Optimal (Discretized) Solution Flow Results Conclusions.

avel
Download Presentation

Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego

  2. Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions

  3. Reliability in MultiCore Systems • Modern multicore processors operate at multiple operating modes • E.g., nominal, supply voltage scaling, turbo, etc. • Reliability is a key processor design consideration at leading-edge technology nodes to guarantee a prescribed system lifetime • Task scheduling affects how cores are used • A subset of cores can fail before others

  4. Scheduling in Multicore Systems • Scheduler packs tasks using some or all the available processing cores Application A #Cores 4 2 1 Time Application B #Cores 4 3 2 1 1 Time

  5. Core Wearout • Mean time to failure (MTTF) is a measure of the lifetime of a core • Reliability mechanisms degrade MTTF of a core • E.g., electromigration (EM), stress migration, hot carrier injection, bias temperature instability, etc. • When all cores are not simultaneously active • Adjust task scheduling on a subset of active cores for balanced wearout

  6. Impact of Overdrive Frequency • Frequency due to overclocking the cores to meet performance and throughput requirements • Overdrive frequencies cause faster MTTF degradation • Two challenges • Can violate “acceptable throughput” for tasks • Cores fail before all assigned tasks are completed • Can violate minimum “acceptable performance” for tasks • Cores operate at lower frequencies

  7. Terminology • Power-on-hours () • Effective number of lifetime hours consumed • Measure of a core’s lifetime degradation due to operating conditions, e.g., temperature, frequency • Nominal temperature • Temperature at which MTTF degradation is the same as the number of hours a core is active • Acceleration factor (AF)

  8. Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions

  9. Classification of Existing Works (N)RC – (Non-) Reliability Constrained (N)LG – (No) Lifetime Guarantee (N)PG – (No) Performance Guarantee

  10. Counterexample to NRC Policies • Task schedule • Max frequency = 3GHz • Min acceptable frequency = 1.8GHz • Initial lifetime = 7 years (61320h) • All cores operate always at 3GHz • From HotSpot simulations, AF = 9.77 • Lifetime after nominal tasks requiring m = 3 is 24947.5h • Tasks requiring m = 3 cannot complete overdrive execution • Tasks requiring m = 4 cannot complete at all Cannot guarantee “acceptable throughput” !!!

  11. Counterexample to RC-LG Policies • Task schedule • Max frequency = 3GHz • Min acceptable frequency = 1.8GHz • Initial lifetime = 61320h • All cores operate initially at 3GHz, and then at 1.6GHz • From HotSpot simulations, AF = 9.77 • All tasks are completed but • Tasks requiring m = 3, 4 operate at 1.6GHz < 1.8GHz (acceptable performance) !!! Cannot guarantee “acceptable performance” !!!

  12. Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions

  13. What Do We Do Differently? • We formulate a new Maximum-Value Reliability-Constrained Overdrive Frequencies (MVRCOF) optimization (offline) problem • Important because • Overdrive frequencies are our optimization variables • User experience is the value • We guarantee prescribed levels of “acceptable performance” and “acceptable throughput”

  14. Comparison of Ours vs. Existing Works (N)RC – (Non-) Reliability Constrained (N)LG – (No) Lifetime Guarantee (N)PG – (No) Performance Guarantee

  15. What is the Optimal Solution? • Task schedule • Max frequency = 3GHz • Min acceptable frequency = 1.8GHz • Initial lifetime = 61320h • Optimal (discretized) solution from exhaustive search We guarantee both “acceptable performance” and “acceptable throughput” if a solution exists!!!

  16. Our Key Contributions • We develop a new MVRCOF formulation to maximize the value of operating multiple cores at overdrive frequencies • Our solutions provide guarantees for prescribed lower bounds on “acceptable performance” and “acceptable throughput” • We propose optimal (discretized) solution using exhaustive search as well as an approximate heuristic flow • Our solutions determine optimal overdrive frequencies as well as execution times for each active core • We empirically determine that our optimal solutions improve the objective function value by up to 17.4% versus existing works

  17. Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions

  18. Formulation

  19. Formulation In English The value of operating at overdrive frequencies () described by weights () and the duration () The value of operating at nominal frequencies () described by weights () and the duration ()

  20. Formulation In English Guarantees minimum “acceptable performance” () and upper bounded by the maximum achievable frequency () Guarantees “acceptable throughput”, i.e., all tasks complete within lifetime and cores wearout in a balanced manner Upper bound on instantaneous power dissipated by any core Upper bound on instantaneous temperature of all actives cores

  21. MVRCOF Inputs: Task Description App 1 Scheduler App 2 App X Execution times in nominal and overdrive modes with different number of active cores El,m wl,m fnom,m Weights in nominal and overdrive modes with different number of active cores Nominal frequencies at different number of active cores

  22. MVRCOF Inputs: System Description SoC Designer Number of available symmetric cores N Pmax fmax Tmax Tnom MTTF Maximum power of any core Maximum frequency of any core Maximum die temperature Nominal temperature Initial MTTF of any core

  23. MVRCOF Outputs MVRCOF solver fOD,m vj,m,l ui,l Optimal overdrive frequencies for each set of active cores %execution time in each combination of the active cores E.g., in a system with three available cores, two cores can be active in ways %lifetime each core operates at nominal and overdrive modes

  24. MVRCOF Inputs and Outputs App 1 SoC Designer Scheduler App 2 App X N Pmaxfmax TmaxTnom MTTF El,mwl,m fnom,m MVRCOF solver Task Description System Description fOD,m vj,m,lui,l Outputs

  25. Outline • Motivation • Previous Work • Our Work • Problem Formulation • Optimal (Discretized) Solution Flow • Results • Conclusions

  26. Optimal (Discretized) Solution Flow • For each core • For each combination in which the core is active • Choose discrete values of overdrive frequencies within a range • Perform power and temperature simulations • Create a one-time LUT • Example: • If a system has 3 cores (Core A, B, C), the number of active cores can be 1, 2 or 3 • Core A is active • One (out of three) combination when ; two (out of three) combinations when ; one (out of one) combination when • Perform exhaustive seach using the LUT for optimal overdrive frequencies that maximize the value of the objective function

  27. Heuristic Flow • We maximize the overdrive frequency in the order of the set of active cores for which the product of weights and execution times is maximum • Example: • If a system has 3 cores, the number of active cores can be 1, 2 or 3 • If , we maximizeand • This achieves large improvements in the value of the objective function

  28. Outline • Motivation • Previous Work • Our Work • Problem Statement • Optimal (Discretized) Solution Flow • Results • Conclusions

  29. Experimental Setup • Each core is simulated with 72 copies of jpeg_encoder from OpenCores • SP&R implementation with commercial tools and foundry 45nm libraries • Power simulation using Synopsys PrimeTime-PX • Increase voltage from 0.8V to 1.2V in steps of 10mV • Increase frequency from 1.5GHz to 3GHz in steps of 50MHz • Thermal simulation using HotSpot • LP solver is lp_solve • Baseline policy is RC-LG from existing works

  30. Testcases • Testcases are described by • Eight testcases in total • Format is -Testcase# • Seven have optimal solutions • One does not have feasible solution • Example

  31. Optimal, Heuristic vs. RC-LG -12% -9% sw

  32. Runtime Comparison 2.5 2.3 2.5 10

  33. Outline • Motivation • Previous Work • Our Work • Problem Statement • Optimal (Discretized) Solution Flow • Results • Conclusions

  34. Conclusions • We formulate and solve a new MVRCOF problem under lifetime reliability constraints • We develop MVRCOF solver that implements our optimal (discretized) and heuristic flows • Our optimal solutions guarantee both “acceptable performance” and “acceptable throughput” • We empirically demonstrate that our optimal solutions achieve up to 17.4% greater value of the objective function than existing works • Our future works include • Application of our methods to traces from actual server workloads • Expand our methods to handle other objectives • Achieve solutions that are temperature history-aware

  35. Thank You!

  36. Back up

  37. Notation • number of simultaneously active cores • number of symmetric cores in a system • index for a core, • overdrive and nominal frequencies when cores are active • weights of achieved for overdrive and nominal frequencies • execution time in overdrive and nominal frequencies • maximum achievable frequency of any core • maximum power consumption of any core • maximum die temperature

  38. Optimal Solution Flow Thermal simulation Power simulation Power(fOD,m) fOD,m For each corei, fOD,m and combination j of m (fOD,m, temp, AF) LUT fOD,m Core (m, j) AF Temp 1 Exhaustive Search Optimal objfn value, fOD,m and tj,m,l LP

More Related