280 likes | 466 Views
Advanced Computer Architectures CSE 8383. March 6 th , 2008 Presentation 2 By: Dina El-Sakaan. RELIABILITY-AWARE POWER MANAGEMENT OF MULTI-CORE PROCESSORS. Jan Haase, Markus Damm, Dennis Hauser, Klaus Waldschmidt,J. W.Goethe-Universitt Frankfurt/Main, Technical Computer Sc. Dep. Agenda.
E N D
Advanced Computer ArchitecturesCSE 8383 March 6th, 2008 Presentation 2 By: Dina El-Sakaan
RELIABILITY-AWARE POWER MANAGEMENTOF MULTI-CORE PROCESSORS Jan Haase, Markus Damm, Dennis Hauser, Klaus Waldschmidt,J. W.Goethe-Universitt Frankfurt/Main, Technical Computer Sc. Dep.
Agenda • Introduction • Dynamic Power Management • Self Distributing Virtual Machine (SDVM) • Reliability and Temperature • Reliability Aware Power Management • Results • Conclusion
Introduction • Smaller feature sizes and increasing power densities lead to a higher vulnerability to wear-out based failure mechanisms like electro-or stress migration • Increasing power consumption have a negative influence on the lifespan • Reliability can be influenced significantly by Dynamic Power Management (DPM)
Temperature Dependant Failure Types • Electro migration • Corrosion • Time-dependent dielectric breakdown (TDDB) • Hot carrier injection (HCI) • Surface inversion • Stress migration
Dynamic Power Management • While DPM tends to lower a processor's temperature, which is beneficial, it also leads to the unfavorable effect of temperature cycling • Temperature cycling: frequent heating up and cooling down.
Self Distributing Virtual Machine (SDVM) 1/3 • A dataflow-driven parallel computing middleware • Can be used to simulate a multi-core processor on a computer cluster or to run a real multi-core processor • Implemented as a daemon to run on each participating processor, creating a site at each • Follows a dataflow principle: Micorframe: Data structure carrying data and code for each site
Self Distributing Virtual Machine (SDVM) 2/3 • Features include: • undisturbed parallel computation while resizing the cluster • distributed dynamic scheduling and thereby automatic load balancing • participating computing resources may have different architectures and processing speeds • support for any connection network topology
Self Distributing Virtual Machine (SDVM) 3/3 • It is divided into three layers: • Execution layer: contains micorframes, processing manager, scheduler, I/O unit • Network layer: Sending messages, encryption, and energy manager • Maintenance layer: Organization of cluster and local site (performance data, physical IP addresses of other sites, currently running applications and where to find their microthreads)
Integration of power management • Energy manager: • Controls the energy state of the local site. • The master regularly defines the new power configuration of each core based upon the temperature and the mean workload of each core. • This information is distributed through the cluster by the SDVM's message mechanism. • The master may even decide to shut down its own site or to quit being the master. • Then election is started again among the remaining sites. • The main task of the energy managers in slave mode is to listen to the master core and to implement its orders, setting the local site to the desired PM-state.
Reliability and Temperature 1/3 • The effect of the temperature can be modeled by the Arrhenius equation, which describes the influence of the temperature on the rate of chemical reactions. The MTTF (Mean-Time-To-Failure) can be estimated by the following formula: • where T is the operating temperature in Kelvin • k is Boltzmann's constant • Ea is the activation energy in electron volts of the precise failure mechanism considered
Reliability and Temperature 2/3 • The effect of thermal cycling on the reliability of a chip can be modeled by the Coffin-Manson relation, which computes the number of cycles to failure, Nf • where T is the magnitude of thermal cycling • C0 is a material-dependant constant • q is the empirically determined Coffin-Manson exponent
Reliability and Temperature 3/3 • Those two equations are used to get an acceleration factor/ratio for each PM-strategy.
Reliability Aware Power Management 1/2 • In this simulation, two reliability aware dynamic power management (RADPM) strategies are considered: • The low-temperature-policy, which tries to keep the temperature as low as possible + limit the temperature of a core to a given maximum tmax • The smooth temperature- policy, whose goal is to restrict thermal cycling • These policies are compared to the (reliability unaware) fast-upgrade policy, which tries to optimize performance
Reliability Aware Power Management 2/2 • The simulated computing environment is a homogenous multi-core-processor with four cores • Each core has four different Power-Management states
Low-temperature Policy No Cores in HF-mode withtemperature > TEMPmax present? Average workload < Min? Average workload > Max? No Yes Yes Yes cores in HF- mode present? No Cores in sleep Mode or OFF-mode present? Keep current configuration No Yes > 1 core in LF- mode present? No Yes No Average workload > MAX2 for more than T sec and cores in LF-mode with temperature <TEMPmax present? Yes No Cores in sleep mode present? Keep current configuration No Yes Among those choose core with highest temperature for transition to LF-mode/resp. sleep-mode/ resp. OFF-mode Yes Among those choose core with highest temperature for transition to LF-mode Among those choose core with lowest temperature for transition to HF-mode
Implementation • The hypothetical temperature TJ of a core is determined out of its power consumption by the formula • Where TA is the environmental temperature, is the thermal resistance of the core and its cooling system, and is the power consumption.
Conclusion • In this presentation, reliability-aware dynamic power management (RADPM) is described, which targets lifespan-controlling goals. • The PM-policies presented are no final solutions for RADPM, but rather serve as a proof of concept, that the long-term reliability of a multi-core chip can actually be improved • Project Idea: Might implement 1 or more of those policies and test them + research the possibility of improving one of the RADPM aware policies • SDVM is the leading candidate simulator for my team to be used in the class project