1 / 30

Thermal-aware Issues in Computers

Thermal-aware Issues in Computers . IMPACT Lab. Part A Overview of Thermal-related Technologies. Importance of thermal management. Cooling cost very high: at providing cool air: equals the power consumed in computation

amelia
Download Presentation

Thermal-aware Issues in Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thermal-aware Issues in Computers IMPACT Lab

  2. Part AOverview of Thermal-related Technologies

  3. Importance of thermal management • Cooling cost very high: • at providing cool air:equals the power consumed in computation • at bring the cool medium (air/liquid) to the circuitry:new density requires $2Watt of material/equipment if 40+ Watts of IC • Excessive heat accelerates material degradation • Power density only to increase in the future

  4. Thermal management at various levels • Physical dimension • At IC level • At chassis/case level • At room level • Software dimension • Firmware level • Operating system level • Middleware level • Application level Source: Intel Source: Apple Source: Berkeley Lab

  5. At integrated circuit level • Issues • Higher temperature  Increased power leakage • Increased power leakage  Higher temperature • Heat density – hot spots • Applied Solutions • Dynamic Voltage Scaling • Dynamic Frequency Scaling • Clock gating (“pause” mode) • Research solutions • Redundant circuitry • Redundant “cores” [Chapparro 2004] • Redundant pipelines [Lim 2002] • Switch from one circuitry to the othereither regularly or when temperatureexceeds levels

  6. At chassis/case level • Issues • Fan capacity at low RPMs not enough for generated heat • Fan noise level at high RPMs too high • Solutions • Dynamic Fan Speed • CPU load balancing • Activity Adjustments • Dynamic Memory bandwidth scaling [Apple TN2156] • Dynamic FSB frequency scaling Layout forces flow ofair in a linear fashion Source: Apple Source: Intel Terms:inlets, outlets

  7. At room level • Solutions: • Pause execution of tasks • Turn machines off • Performance impacts • Degraded performance Source: www.cix.ie Source: Elibo, Hong Kong Terms:hot aisle, cold aisle, raised floors, CRAC/HVAC

  8. A typical data center Source: Siemens Terms:hot aisle, cold aisle, raised floors, CRAC/HVAC

  9. CRAC & thermal maps: knowing where the hot spots are • Purpose • Knowing air temperature at any 3-D point • Adjust CRAC operation • Adjust computer operation • Obtaining by • Strategically placed sensors • On-board sensors • Predicting by • Thorough testing • CFD simulations

  10. Thermal issues in dense computer rooms (Data centers, Computer Clusters, Data warehouses) • Heat recirculation • Hot air from the equipment outlets is fed back to the equipment inlets • Hot spots • Effect of Heat Recirculation • Areas in the data center with alarmingly high temperature • Impact • Cooling has to be set well low to have allinlet temperatures in safe operating range Courtesy: Intel Labs Terms:heat recirculation, hot spots,inlet temperatures, outlet temperatures,redline temperature, peak temperature

  11. Thermal Management solutions softwaredimension Application Data centerjob scheduling (middleware) Thermal-aware JVM O/S CPU Load balancing Dynamic voltage scaling Fan speed scaling Dynamic frequency scaling firmware Circuitry redundancy IC Case/chassis room physicaldimension

  12. Part BReducing Heat Recirculation(at room level)

  13. Reducing heat recirculation (1) • Heat Recirculation is the only reason for increase inlet temperatures • Without recirculation, the inlet temperatures would be equal to supplied air temp. • The peak inlet temperature defines the CRAC operational temperature Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling 25C

  14. Reducing heat recirculation (2) • First things first • Find the causes of it • Find ways to predict it • What is causing it • The air flow from the CRAC is not adequate to feed all inlets • Imperfect layout • Usually 1. and 2. are not adjustable once the equipment is bought and in place • Find other ways to reduce it

  15. Reducing heat recirculation (3) • Other ways to reduce it • Find who is contributing the most heat recirculation • Mitigate the heat recirculation by throttling activity at main contributors of recirculation(contributor = equipment unit that is generating heat)(throttling activity = change the jobs or the execution of them) • How to know how much heat each equipment contributes? • But: how to know how much heat each equipment generates? (i.e. power profile)

  16. If we had a mechanism like this we could predict the effects of a running (or potentially running) job and decide about its fate according to its effects Reducing heat recirculation(general plan of action) Assess the effect of a task on the equipment (cpu, memory, I/O) Assess the heat generated bythe equipment from the task Assess how much of thatheat is recirculated Assess the inlet temperaturesgiven the heat recirculation Terms:task profile, power profile,thermal map prediction

  17. Task profiling (1) • Task profiling • Assess how much CPU utilization, memory activity, disk I/O, network traffic etc, the application generates • Task profiling can be done • Offline, by code analyzers, or • Online, by test runs • Dirty (and convenient) fact about HPC (high-performance computing): • Incoming jobs have highly predictable profile

  18. Power profiling • Power Profiling • Assess how much heat is generated from each component (i.e. CPU, memory, disk I/O, network etc) • Assess how much power is consumed from each component (i.e. CPU, memory, disk I/O, network etc) • Power profiling is usually preformed offline

  19. Example results of power profiling • Power Consumption is mainly affected by the CPU utilization • Power consumption is linear to the CPU utilizationP = a U + b

  20. A simple thermal model From other machines to other machines From A/C To A/C Power consumed

  21. Effect of CPU utilization to outlet temperature • Task profiling • Assess how much CPU utilization the application generates • Outlet Temperature is a function of utilization plus inputToutlet = f(U) + Tinlet

  22. Assessing recirculation for the given computational tasks • Assessing Recirculation • Obtaining the thermal map for the given task assignment • Compare with offline measurements • But we don’t need to know the temperature at every point in the air • Only at the inlets and the outlets N5 Courtesy: Intel Labs N4 N3 N2 N1

  23. Recirculation coefficients • Purpose • Knowing air temperature at any 3-D point • Adjust CRAC operation • Adjust computer operation • Obtaining by • Strategically placed sensors • On-board sensors • Predicting by • Thorough testing • CFD simulations

  24. Different demands for cooling capacity How scheduling impacts cooling cost Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling Scheduling 1 25C Scheduling 2 25C

  25. Part CIntegrated Thermal-aware Management

  26. Functional model of scheduling • Tasks arrive at the data center • Scheduler figures out the best placement • Placement that has minimal impact on peak inlet temperatures • Assigns task accordingly Tasks Scheduler Task Task

  27. Architectural View Scheduler(SLURM)

  28. Part DPotential Term Projects

  29. Scheduling Algorithms • Current work assumed incoming jobs that • Are Identical (same profile) • Are long-running • Enhance scheduling algorithm to work with • Heterogeneous data center • Asynchronous job arrival • Jobs have non-identical execution time

  30. Scheduler Programming • Enhance existing job management software (Moab, SLURM etc) to work with • Gathering thermal data • Assigning jobs according to policy

More Related