240 likes | 250 Views
ConSil is a system designed to analyze data center thermals, manage heat proactively, and promote an even temperature distribution through temperature-aware workload placement.
E N D
ConSil Jeff Chase Duke University
Collaborators • Justin Moore • received PhD in April, en route to Google. • Did this research. • Wrote this paper. • Named the system. • Something to do with “Get Smart” (?) • Did not send me slides… • Partha Ranganathan (HP) has led this work.
Context: Dynamic Thermal Management for Data Centers CRAC Temperature Scale (C) Rack Heat build-ups
Goals • ConSil is part of a larger system to analyze data center thermals and manage heat proactively. • Temperature-aware workload placement • “Smart cooling” • Preliminary conclusion: it is practical to reduce total energy by about 15% under “typical” conditions. • Your mileage may vary. • Other goals: • Reduce capital cost with “common case” cooling system. • Allow cluster to “burst”, but stop short of meltdown. • Improve long-term reliability and availability • Better data center design
“Green” Workload Placement Place workload intelligently to promote an even temperature distribution, given the “thermal topology” of the data center. Making Scheduling "Cool": Temperature-Aware Resource Assignment in Data Centers by Justin Moore, J. Chase, P. Ranganathan, and R. Sharma. In the 2005 USENIX Annual Technical Conference, April 2005
The Subproblem that Consil Solves • How hot is point (x, y, z) in your data center? • Placement policies need a thermal map • Option 1: install new instrumentation • Tradeoff $$$ vs. granularity • Option 2: use built-in sensors • But: how to derive the inlet temperatures? • If we can do that, then we can obtain a precise and accurate thermal map with low instrumentation cost.
Thermal Instrumentation Observed: ▲= f(▲, ▲) Learn: ▲= g(▲, ▲) Heat Sources (Qworkload) Inlet Heat (Qinlet) Temperature Sensors (Qobserved)
ConSil in Context Workload measures
Attributes Samples X1 X2 . . Xn Y s1 s11 s12 . . s1n Y1 s2 s21 s22 . . s2n Y2 . . . . . . . sm sm1 sm2 . . smn Ym Learning a Model • Learn statistical model for Y from m samples of
First Cut: Neural Nets • Infer ambient temperature from an input sample: • Last N workload measure samples (epoch E) • Internal temperature sensor readings • Use off-the-shelf FANN library • Some static (SWAG) structural choices: • Four layers of neurons • Inputhiddenhiddenoutput • Neurons use FANN sigmoid transform function • Train the net using FANN back-propagation to set input weights on each neuron.
Experiments with Consil • Collected data for 12 servers in a data center. • Pick servers whose inlet temperatures are known • i.e., they have a sensor near them • 45 hours of data collected under active/varying load • Two server models (HP DL360 G3, Dell 1425) • CPU data: 1 second granularities • temperature data: 5 or 30 second granularities • CPU utilization only • CPU uses 80% of power (225/275 watts peak) • 266 Lines of FANN code
Methodology • FFCV • Divide observations into fifths • Train on one fifth, test on four • Do it for each fifth • Compute SSE • Output: CDFs of errors • Sensitivity study • Training time • Accuracy
ConSil: Accuracy • Accurate inference using workload and onboard data • 75% of inferred values are within 1C of actual value
Sensitivity • Time-to-train • Most significant: FFCV sub-experiment • Training time is highly data-dependent • Epoch length • Number of sensor/workload epochs • Accuracy (SSE) • Most significant: FFCV sub-experiment • Indicates not enough variation in behavior • Coarse granularity (more history) improves
ConSil in Context Workload measures
Predicting Thermal Effects • Model relationship using machine learning • Inputs: Workload data, AC settings, fan speeds • Output: Predicted thermal map • Learns from observations during normal operation • FANN neural net library • Active “burn in” may speed learning Weatherman: Automated, Online, and Predictive Thermal Mapping and Management for Data Centers by Justin Moore, J. Chase, and P. Ranganathan. Third IEEE International Conference on Autonomic Computing, June 2006.
Weatherman: Accuracy • Accurate inferences using workload and AC data • Data from validated Flovent CFD models • 92% of predicted values are within 1.0C of actual value
Summary/Conclusion • Machine learning is a useful tool for “autonomic” self-optimization. • Sense and respond • Optimizing control loops based on learned models • Neural nets don’t always suck. • Initial results suggest they work well here. • Maybe we can do better. • Need good baseline datasets for training/validation. • Variance • History
Why “ConSil”? • Cone of Silence • “Mask out” unwanted signals
The maximum number of training iterations was set to $10^5$. Each neural net contained one input, one output, and two hidden layers. Each hidden layer contained twice the number of neurons as the input layer; varying the number of recent epochs we use as input, we vary the number of workload epochs~---~parameter $B$~---~and internal sensor epochs~---~parameter $C$~---~independently. Using general full factorial design analysis, we can identify which parameters have a significant effect when changed, and for which parameters we can simply select a ``reasonable'' value.