220 likes | 383 Views
C-Oracle: Predictive Thermal Management for Data Centers. Luiz Ramos Ricardo Bianchini HPCA 2008. Motivation (1/4). Server clusters in data centers Higher power densities higher temperatures Expensive cooling Thermal emergencies Failed fans or air conditioners
E N D
C-Oracle: Predictive Thermal Management for Data Centers Luiz Ramos Ricardo Bianchini HPCA 2008
Motivation (1/4) • Server clusters in data centers • Higher power densities higher temperatures • Expensive cooling • Thermal emergencies • Failed fans or air conditioners • Poor cooling or air distribution • Hot spots • Brownouts • Component reliability decreases • Unpredictable behaviors or failures • Can impact system performance and availability
Motivation (2/4) • Hardware-level thermal management (TM) • Disregards high-level information • E.g. CPU shutdown mechanism • Unnecessary performance loss • Software TM policies • More sophisticated reactions to emergencies • E.g. reduce load on “hot server” in a datacenter • Example: Freon for Internet services [ASPLOS’06]
Motivation (3/4) Disable reaction(Restore load) Enable reaction(Reduce load) Tcpu Tcpu Tdisk Tdisk (HOT!) • Freon • Move load away from “hot server” • Feedback control and admission control W1 Server software tempd Front-end node Web requests Server 1 Load-balancing software W2 admd Server software tempd Server 2
Motivation (4/4) • Need for prediction • Single pre-defined reaction and set of parameters • Severe: performance loss, new emergencies • Mild: take too long or not be effective • Our approach • Predict behavior of potential TM reactions online • Selects the best reaction • C-Oracle (Celsius-Oracle)
Outline • Motivation • C-Oracle • Overview • Design • Predictive Policies • Experimental Results • Related Work • Conclusions
C-Oracle Overview C-Oracle 50% CPU frequency reduction 25% load reduction 1) Prediction kept until expiration2) Regularly checked 1) Decision algorithm2) Check prediction status TM system Summary of each prediction Server software In the next 5 minutesa) Decrease load by 25%? b) Decrease CPU frequency by 50%? Server 1 (HOT!)
C-Oracle Architecture Proxy Oracle predicted utilizations Models of TM policies Oracle driver Machinethermalmodels Solver predictedtemperatures Request handler prediction requests predictedtemperatures andperformance real utilizations and temperatures Monitor Monitor Reactions +Decision Alg. Reactions +Decision Alg. … Server software Server software Server 1 Server N
C-Oracle Details • Oracle’s thermal model (in Solver) • Conservation of energy • Newton’s law of cooling • Power(utilization) • Energy equivalent • Heat capacity • Model of TM policies (in Proxy) • Model based on actual policy code • Policy designer uses primitives from Proxy’s library • E.g. read temperature, change load distribution
Reactions and Decision Algorithm • Base TM policies for Internet services • Freon • LiquidN2 (new) • LiquidN2 • Problem with Freon: assumes control over distribution • Services with session state (e.g., shopping cart) • Slow down hot server (DVFS) • Least-connections • Feedback control and admission control
Reactions and Decision Algorithm • Reactions = policies + parameters • CFreon: Freon using C-Oracle • Weak: moves load away from hot servers • Strong: 6x weak reaction • CLiquidN2: LiquidN2 using C-Oracle • Weak: reduce CPU frequency of hot servers • Strong: 4x weak reaction • Decision algorithm • No server shutdown • Lowest temperature after 5 min • No request drop
Outline • Motivation • C-Oracle • Overview • Design • Predictive Policies • Experimental Results • Related Work • Conclusions
CFreon Results (1/3) • Experimental setup • Single-tier Web service (no session state) • Workload is a sequence of peaks and valleys • 1 front-end (LVS) and 4 servers (Apache) • Mercury emulates temperatures [ASPLOS’06]
CFreon Results (2/3) Strong reaction Weak reaction 1st emergency 2nd emergency Thigh More effective TM and good accuracy
CFreon Results (3/3) Accurate predictions of average utilization, avoids request drops
CLiquidN2 Results (1/3) • Experimental setup • Three-tier auction service • Session state (auctions of interest) stored in the 2nd tier • Apache (2), Tomcat (2), MySQL (1) tiers • 1 LVS node – distributes load between tiers • CPU has 8 DVFS steps – 2.8GHz to 350MHz
CLiquidN2 Results (2/3) Strong reaction Weak reaction Thigh CPU frequency set to 75%
Outline • Motivation • C-Oracle • Overview • Design • Predictive Policies • Experimental Results • Related Work • Conclusions
Related Work • TM policies for data centers • For thermal emergencies • For normal operation • TM predictions • Offline evaluation • For batch systems, reducing cooling costs • Our work • For thermal emergencies • Online predictions • LiquidN2: DVFS + request distribution + adm control
Conclusions • LiquidN2 useful for services with session state • C-Oracle allows the prediction of TM reactions • Predictive policies make best available decisions