300 likes | 496 Views
Thermal-aware Issues in Computers . IMPACT Lab. Part A Overview of Thermal-related Technologies. Importance of thermal management. Cooling cost very high: at providing cool air: equals the power consumed in computation
E N D
Thermal-aware Issues in Computers IMPACT Lab
Importance of thermal management • Cooling cost very high: • at providing cool air:equals the power consumed in computation • at bring the cool medium (air/liquid) to the circuitry:new density requires $2Watt of material/equipment if 40+ Watts of IC • Excessive heat accelerates material degradation • Power density only to increase in the future
Thermal management at various levels • Physical dimension • At IC level • At chassis/case level • At room level • Software dimension • Firmware level • Operating system level • Middleware level • Application level Source: Intel Source: Apple Source: Berkeley Lab
At integrated circuit level • Issues • Higher temperature Increased power leakage • Increased power leakage Higher temperature • Heat density – hot spots • Applied Solutions • Dynamic Voltage Scaling • Dynamic Frequency Scaling • Clock gating (“pause” mode) • Research solutions • Redundant circuitry • Redundant “cores” [Chapparro 2004] • Redundant pipelines [Lim 2002] • Switch from one circuitry to the othereither regularly or when temperatureexceeds levels
At chassis/case level • Issues • Fan capacity at low RPMs not enough for generated heat • Fan noise level at high RPMs too high • Solutions • Dynamic Fan Speed • CPU load balancing • Activity Adjustments • Dynamic Memory bandwidth scaling [Apple TN2156] • Dynamic FSB frequency scaling Layout forces flow ofair in a linear fashion Source: Apple Source: Intel Terms:inlets, outlets
At room level • Solutions: • Pause execution of tasks • Turn machines off • Performance impacts • Degraded performance Source: www.cix.ie Source: Elibo, Hong Kong Terms:hot aisle, cold aisle, raised floors, CRAC/HVAC
A typical data center Source: Siemens Terms:hot aisle, cold aisle, raised floors, CRAC/HVAC
CRAC & thermal maps: knowing where the hot spots are • Purpose • Knowing air temperature at any 3-D point • Adjust CRAC operation • Adjust computer operation • Obtaining by • Strategically placed sensors • On-board sensors • Predicting by • Thorough testing • CFD simulations
Thermal issues in dense computer rooms (Data centers, Computer Clusters, Data warehouses) • Heat recirculation • Hot air from the equipment outlets is fed back to the equipment inlets • Hot spots • Effect of Heat Recirculation • Areas in the data center with alarmingly high temperature • Impact • Cooling has to be set well low to have allinlet temperatures in safe operating range Courtesy: Intel Labs Terms:heat recirculation, hot spots,inlet temperatures, outlet temperatures,redline temperature, peak temperature
Thermal Management solutions softwaredimension Application Data centerjob scheduling (middleware) Thermal-aware JVM O/S CPU Load balancing Dynamic voltage scaling Fan speed scaling Dynamic frequency scaling firmware Circuitry redundancy IC Case/chassis room physicaldimension
Reducing heat recirculation (1) • Heat Recirculation is the only reason for increase inlet temperatures • Without recirculation, the inlet temperatures would be equal to supplied air temp. • The peak inlet temperature defines the CRAC operational temperature Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling 25C
Reducing heat recirculation (2) • First things first • Find the causes of it • Find ways to predict it • What is causing it • The air flow from the CRAC is not adequate to feed all inlets • Imperfect layout • Usually 1. and 2. are not adjustable once the equipment is bought and in place • Find other ways to reduce it
Reducing heat recirculation (3) • Other ways to reduce it • Find who is contributing the most heat recirculation • Mitigate the heat recirculation by throttling activity at main contributors of recirculation(contributor = equipment unit that is generating heat)(throttling activity = change the jobs or the execution of them) • How to know how much heat each equipment contributes? • But: how to know how much heat each equipment generates? (i.e. power profile)
If we had a mechanism like this we could predict the effects of a running (or potentially running) job and decide about its fate according to its effects Reducing heat recirculation(general plan of action) Assess the effect of a task on the equipment (cpu, memory, I/O) Assess the heat generated bythe equipment from the task Assess how much of thatheat is recirculated Assess the inlet temperaturesgiven the heat recirculation Terms:task profile, power profile,thermal map prediction
Task profiling (1) • Task profiling • Assess how much CPU utilization, memory activity, disk I/O, network traffic etc, the application generates • Task profiling can be done • Offline, by code analyzers, or • Online, by test runs • Dirty (and convenient) fact about HPC (high-performance computing): • Incoming jobs have highly predictable profile
Power profiling • Power Profiling • Assess how much heat is generated from each component (i.e. CPU, memory, disk I/O, network etc) • Assess how much power is consumed from each component (i.e. CPU, memory, disk I/O, network etc) • Power profiling is usually preformed offline
Example results of power profiling • Power Consumption is mainly affected by the CPU utilization • Power consumption is linear to the CPU utilizationP = a U + b
A simple thermal model From other machines to other machines From A/C To A/C Power consumed
Effect of CPU utilization to outlet temperature • Task profiling • Assess how much CPU utilization the application generates • Outlet Temperature is a function of utilization plus inputToutlet = f(U) + Tinlet
Assessing recirculation for the given computational tasks • Assessing Recirculation • Obtaining the thermal map for the given task assignment • Compare with offline measurements • But we don’t need to know the temperature at every point in the air • Only at the inlets and the outlets N5 Courtesy: Intel Labs N4 N3 N2 N1
Recirculation coefficients • Purpose • Knowing air temperature at any 3-D point • Adjust CRAC operation • Adjust computer operation • Obtaining by • Strategically placed sensors • On-board sensors • Predicting by • Thorough testing • CFD simulations
Different demands for cooling capacity How scheduling impacts cooling cost Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling Scheduling 1 25C Scheduling 2 25C
Functional model of scheduling • Tasks arrive at the data center • Scheduler figures out the best placement • Placement that has minimal impact on peak inlet temperatures • Assigns task accordingly Tasks Scheduler Task Task
Architectural View Scheduler(SLURM)
Scheduling Algorithms • Current work assumed incoming jobs that • Are Identical (same profile) • Are long-running • Enhance scheduling algorithm to work with • Heterogeneous data center • Asynchronous job arrival • Jobs have non-identical execution time
Scheduler Programming • Enhance existing job management software (Moab, SLURM etc) to work with • Gathering thermal data • Assigning jobs according to policy