550 likes | 662 Views
ISCA 2004 Tutorial. Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th 8:00am - 5:00pm . Presenters:. Kevin Skadron ( skadron@cs.virginia.edu ) CS Department, University of Virginia David Brooks ( dbrooks@eecs.harvard.edu ) CS Department, Harvard University
E N D
ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19th 8:00am - 5:00pm
Presenters: Kevin Skadron (skadron@cs.virginia.edu) CS Department, University of Virginia David Brooks (dbrooks@eecs.harvard.edu) CS Department, Harvard University Antonio Gonzalez (antonio@ac.upc.es) UPC-Barcelona, and Intel Barcelona Research Center Lev Finkelstein (lev.finkelstein@intel.com) Intel Haifa Mircea Stan (mircea@virginia.edu) ECE Department, University of Virginia
Overview • Motivation (Kevin)1.5 hrs • Thermal issues (Kevin) • Power modeling (David)1.5 • Thermal management (David)hrs • Optimal DTM (Lev).5 hrs • Clustering (Antonio)1 hr • Power distribution (David)15 min • What current chips do (Lev) 45 min • HotSpot and sensors (Kevin)1 hr
Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)
Motivation • Power consumption: first-order design constraint • unconstrained power is a theoretical max • peak (inst.) power is limiting power delivery • sustained power limits thermal design/packaging • max sustained power: thermal “virus” • same as thermal design power • average active power and idle power limit mobile battery life, etc. • Common fallacy: instantaneous power temperature • Power-density is increasing exponentially • Unfortunate corollary of Moore’s Law • thermal effects become more problematic • Need Power/Temperature-aware computing!
Power Dissipation Source: Microprocessor Report
Effects of Technology Scaling on Power Dissipation • Feature size is scaling down • 30% • Frequency is increasing • ~2x • Area increases due to microarchitecture improvements • 25% (Ideal scaling: decreases by50%) • Active capacitance increases • at least 30% (Ideal scaling: decreases by 30%) • Vdd is not scaled down at the same rate as feature size • 0-10% (Ideal scaling: 30%) • Ideal scaling: P CV2f → 0.72reduction 0.5 • Observed scaling → 2 – 2.5x increase • Power density becomes a problem! • Especially since the power density is non-uniform
2 Watts/cm Trends in Power Density Sun's Surface 1000 Rocket Nozzle Nuclear Reactor 100 Pentium® 4 Pentium® III Hot plate Pentium® II 10 Pentium® Pro Pentium® i386 i486 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
These are targets Power-density problem is still getting worse Intel papers suggest that in the 45-75W range, cooling costs $1/W; but then rate of increase goes up: $2, $3/W, probably more!(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01) ITRS Projections ITRS 2001
Leakage Power • The fraction of leakage power is increasing exponentially with each generation • Also exponentially dependent on temperature Increasingratioacross generations Source: Sankaranarayanan et al, University of Virginia
Power-aware figures of merit • Power (P): battery time (mobile) (1/W) packaging (high-performance) • Energy (PD): battery life (mobile) (MIPS/W) fundamental limits (kT) • Energy-delay (PD2): (MIPS2/W) performance and low power • Energy-delay2 (PD3): indep. of Vdd (MIPS3/W) emphasis on performance • Power-aware low power • Similar to “old” VLSI complexity (A,AD,AD^2) • Noneof these are appropriate for thermal • This is a problem Refs: R. Gonzales et al. “Supply and threshold voltage scaling for low power CMOS”, JSSC, Aug. 1997 A. Martin et al. “Design of an Asynchronous MIPS R3000”, ARVLSI’97 J. Ullman, “Computational aspects of VLSI”, CS Press, 1984
Cooking-aware computing • Some chips rated for 100°C+
Power and temperature are BAD • and can be EVIL Source: Tom’s Hardware Guidehttp://www6.tomshardware.com/cpu/01q3/010917/heatvideo-01.html
Other Costs of High Heat Flux • Some chips may already be underclocked due to thermal constraints! • (especially mobile and sealed systems)
Temporal, Spatial Variations Temperature variation of SPEC applu over time Hot spots increase cooling costs must cool for hot spot
Application Variations • Wide variation across applications • Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT) • Leakage is an especially severe problem: exponentially dependent on temperature!
Heat vs. Temperature • Different time scales • Heat: no notion of spatial locality • Does architecture have a role? Temperature-aware computing: Optimize performance subject to a temperature constraint
Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot and sensors (Kevin)
Thermal issues Temperature affects: • Circuit performance • Circuit power (leakage) • IC reliability • IC and system packaging cost • Environment
Performance and leakage Temperature affects : • Transistor threshold and mobility • Subthreshold leakage, gate leakage • Ion, Ioff, Igate, delay • ITRS: 85°C for high-performance, 110°C for embedded! Ioff Ion NMOS
Temperature-aware circuits • Robustness constraint: sets Ion/Ioff ratio • Robustness and reliability: Ion/Igate ratio Idea: keep ratios constant with T: trade leakage for performance! Ref: “Ghoshal et al. “Refrigeration Technologies…”, ISSCC 2000 Garrett et al. “T3…”, ISCAS 2001
Resulting performance 25% - 30% extra performance (110oC to 0oC) regular TAC
Reliability The Arrhenius Equation:MTF=A*exp(Ea/K*T) MTF: mean time to failure at T A: empirical constant Ea: activation energy K: Boltzmann’s constant T: absolute temperature Failure mechanisms: Die metalization (Corrosion, Electromigration, Contact spiking) Oxide (charge trapping, gate oxide breakdown, hot electrons) Device (ionic contamination, second breakdown, surface-charge) Die attach (fracture, thermal breakdown, adhesion fatigue) Interconnect (wirebond failure, flip-chip joint failure) Package (cracking, whisker and dendritic growth, lid seal failure) Most of the above increase with T (Arrhenius) Notable exception: hot electrons are worse at low temperatures More on this later
Packaging cost From Cray (local power generator and refrigeration)… Source: Gordon Bell, “A Seymour Cray perspective” http://www.research.microsoft.com/users/gbell/craytalk/
Packaging cost To today… • Grid computing: power plants co-located near compute farms • IBM S/390: refrigeration Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D
IBM S/390 refrigeration • Complex and expensive Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D
IBM S/390 processor packaging Processor subassembly: complex! C4: Controlled Collapse Chip Connection (flip-chip) Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling” IBM Journal of R&D
Intel Itanium packaging Complex and expensive (note heatpipe) Source: H. Xie et al. “Packaging the Itanium Microprocessor” Electronic Components and Technology Conference 2002
Intel Pentium 4 packaging • Simpler, but still… Source: Intel web site
Graphics Cards • Nvidia GeForce 5900 card Source: Tech-Report.com
Under/Overclocking • Some chips need to be underclocked • Especially true in constrained form factors • Try fitting this in a laptop or Gameboy! Ultra model of Gigabyte's 3D Cooler Series Source: Tom’s Hardware Guide
Apple G5 – liquid cooling • Don’t know details • Lots of people in thermal engineering community think liquid is inevitable, especially for server rooms • But others say no: • This introduces a whole new kind of leakage problem • Water and electronics don’t mix!
Environment • Environment Protection Agency (EPA): computers consume 10% of commercial electricity consumption • This incl. peripherals, possibly also manufacturing • A DOE report suggested this percentage is much lower • No consensus, but it’s still a lot • Equivalent power (with only 30% efficiency) for AC • CFCs used for refrigeration • Lap burn • Fan noise
Heat mechanisms • Conduction • Convection • Radiation • Phase change • Heat storage
A Conduction • Similar to electrical conduction (e.g. metals are good conductors) • Heat flow from high energy to low energy • Microscopic (vibration, adjacent molecules, electron transport) • No major displacement of molecules • Need a material: typically in solids (fluids: distance between mol) • Typical example: thermal “slug”, spreader, heatsink Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
Conduction Not a strongfunction oftemperature But for the hightemp. variationson high-perf. chips,(30+°), it matters Note esp. Sivs. Al, Cu Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
Convection • Macroscopic (bulk transport, mix of hot and cold, energy storage) • Need material (typically in fluids, liquid, gas) • Natural vs. forced (gas or liquid) • Typical example: heatsink (fan), liquid cooling • Note that convection is profoundly affected by board layout Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
Radiation • Electromagnetic waves (can occur in vacuum) • Negligible in typical applications • Sometimes the only mechanism (e.g. in space) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
Carnot Efficiency • Note that in all cases, heat transfer is proportional to ΔT • This is also one of the reasons energy “harvesting” in computers is probably not cost-effective • ΔT w.r.t. ambient is << 100° • For example, with a 25W processor, thermoelectric effect yields only ~50mW • Solbrekken et al, ITHERM’04 • This is also why Peltier coolers are not energy efficient • 10% eff., vs. 30% for a refrigerator
Surface-to-surface contacts • Not negligible, heat crowding • Thermal greases/epoxy (can “pump-out”) • Phase Change Films (undergo a transition from solid to semi-solid with the application of heat) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
Phase-change Thermal solutions evolution: • Natural air cooling • Forced-air cooling • Liquid cooling • Phase change (e.g. heat pipe) • Refrigeration Phase change: a. Solid changing to a liquid—fusion, or melting, b. Liquid changing to a vapor—evaporation, also boiling, c. Vapor changing to a liquid—condensation, e. Liquid changing to a solid—crystallization, or freezing, f. Solid changing to a vapor—sublimation, g. Vapor changing to a solid—deposition.
Thermal resistance • Θ = rt / A = t / kA
Thermal capacitance • Cth = V·Cp· (Aluminum) = 2,710 kg/m3 Cp(Aluminum) = 875 J/(kg-°C) V = t· A = 0.000025 m3 Cbulk = V·Cp· = 59.28 J/°C
Refrigeration “conventional” vs. thermo-electric (TEC) • Can get T < T_amb (“negative” Rth!) TEC: Peltier effect (can use for local cooling)
T_hot T_amb Simplistic steady-state model All thermal transfer: R = k/A Power density matters! Ohm’s law for thermals (steady-state) V = I · R -> T = P · R T_hot = P · Rth + T_amb Ways to reduce T_hot: • reduce P (power-aware) • reduce Rth (packaging) • reduce T_amb (Alaska?) • maybe also take advantage of transients (Cth)
T_hot T_amb Simplistic dynamic thermal model Electrical-thermal duality V temp (T) I power (P) R thermal resistance (Rth) C thermal capacitance (Cth) RC time constant KCL differential eq. I = C · dV/dt + V/R difference eq. V = I/C · t + V/RC · t thermal domain T = P/C · t + T/RC · t (T = T_hot – T_amb) One can compute stepwise changes in temperature for any granularity at which one can get P, T, R, C
Combined package model Note: Θja is meaningless! Steady-state Tj – junction temperature Tc – case temperature Ts – heatsink temperature Ta – ambient temperature What exactly is Ta? Guts of the component Θjc is better but still sketchy Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
Reliability as f(T) • Reliability criteria (e.g., DTM thresholds) are typically based on worst-case assumptions • But actual behavior is often not worst case • So aging occurs more slowly • This means the DTM design is over-engineered! • We can exploit this, e.g. for DTM or frequency Spend Bank