200 likes | 342 Views
Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips. Caitie McCaffrey, Yemi Adesanya August 2006.
E N D
Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006
“The SLAC Computing Services Group is dedicated to providing leadership and support in computing and communications to the laboratory as a whole, and to physics research, in particular” Major Concerns • Power consumption • Cooling • Monitoring
I/O Rate CPU usage Memory Usage Temperature Fan Speed Load What Is My Computer Doing??? Monitoring Software -low overhead -scalable -low impact on individual machines
“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids” • Scalable, overhead increases by number of clusters not nodes • Works on multiple operating systems • Round Robin Database • Measures metrics like CPU usage, load, I/O rate, and memory usage GMOND, GMETAD, GMETRIC
Ganglia Architecture http://www.slac.stanford.edu/comp/unix/ganglia/index.html Updates RRD, polls clusters periodically Cluster Two Machines 1 and 3 know state of entire cluster 1 2 A 4 3 Cluster One All machines know state of entire cluster B C
GMETRIC Allows users to monitor metrics to expand on the core monitored by the daemon gmond • Name • Value • Type • Units gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius Good because allows us to be more machine specific, can monitor temperature and fan speed
A little bit on hardware Noma - batch machines • Tyan Thunder LE-T motherboard • Winbond w83782d (lm_sensor compatible) • 2 pentium III processors • Why is temperature important? • Chip specifications give temperature range • Behavior is unpredictable outside temperature range • Clues to weird machine behavior • Pentiums have a max temp of 77°-82° C Tyan Thunder LE-T
What’s a Noma? NOMA • Horse from Noma County Japan • Smallest native Japanese pony 10.1 -10.3 hands • Super rare 27 pure blood nomas left (1988) Some more machines DON COB TORI ORLOV MORAB
caitiem@noma0449 $ sensors • w83782d-i2c-0-29 • Adapter: SMBus PIIX4 adapter at 0580 • Algorithm: Non-I2C SMBus adapter • VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V) • VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V) • +3.3V: +3.37 V (min = +2.97 V, max = +3.63 V) • +5V: +4.97 V (min = +4.50 V, max = +5.48 V) • +12V: +12.08 V (min = +10.79 V, max = +13.11 V) • -12V: -1.03 V (min = -13.21 V, max = -10.90 V) • -5V: +2.84 V (min = -5.51 V, max = -4.51 V) • V5SB: +5.12 V (min = +4.50 V, max = +5.48 V) • VBat: +3.34 V (min = +2.70 V, max = +3.29 V) • fan1: 8231 RPM (min = 3000 RPM, div = 2) • fan2: 8333 RPM (min = 3000 RPM, div = 2) • fan3: 0 RPM (min = 3000 RPM, div = 2) • temp1: +77°C (limit = +60°C) sensor = thermistor • ALARM • temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor • ALARM • temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor • ALARM • vid: +1.450 V • alarms: Chassis intrusion detection ALARM • beep_enable: • Sound alarm disabled
Perl Fills gap between low level languages like C and C++ and high level languages like shell. -mostly fast -basically unlimited -good for working with text -portable Regular Expressions /^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/ matches temp1: +77°C (limit = +60°C) sensor = thermistor temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
Sample Time - Decreasing • Time interval = 12.15 minutes • Fri Aug 11 03:04:05 PDT 2006 • FanSpeed1 8035 • FanSpeed2 7941 • Temp 1: 77 • Change: 0 • Temp 2: 64.0 • Change: 0 • Temp 3: 64.0 • Change: 1 • Time interval = 9.8415 minutes • Fri Aug 11 03:16:15 PDT 2006 Want Sample time to decrease faster when temperatures are changing faster New time = old time * Decrement ^(Change / Trigger) *if new time < min time then newTime = minTime • Parameters • Trigger = 0.5 degrees • Decrement = 0.9 • MaxTime = 15 minutes • MinTime = 1 minute New time = 12.15 * .9 ^ (1 / .05) = 9.8415
Sample Time – Increasing • Time interval = 12.15 minutes • Fri Aug 11 08:25:18 PDT 2006 • Found FanSpeed1 8035 • Found FanSpeed2 7941 • Temp 1: 77 • Change: 0 • Temp 2: 64.0 • Change: 0 • Temp 3: 64.0 • Change: 0 • Time interval = 13.5 minutes • Fri Aug 11 08:37:28 PDT 2006 Want Sample Time to Increase Temperature is changing slowly or not at all *If we increase by large amounts we could miss valuable data NewTime = OldTime / Decrement • Parameters • Trigger = 0.5 degrees • Decrement = 0.9 • MaxTime = 15 minutes • MinTime = 1 minute NewTime = 12.15 / 0.9 = 13.5
noma0450 noma0449
Up and running on two Nomas currently • Noma0449 • Noma0450 Will be installed on all Nomas Can be used on any Ganglia monitored machine with a compatible Winbond chip Acknowledgements Much thanks to the DOE, SCCS systems group and especially Yemi Adesanya, John Goebel, & Karl Amrhein for all their help throughout the summer.
Smartmontools for SCSI devices • Command smartctl –l error /dev/sda Error counter log: Errors Corrected Total Total Correction Gigabytes Total delay: [rereads/ errors algorithm processed uncorrected minor | major rewrites] corrected invocations [10^9 bytes] errors read: 234237 0 0 234237 234237 605.516 0 write: 0 0 0 0 0 1457.589 0 Non-medium error count: 0 http://smartmontools.sourceforge.net/smartmontools_scsi.html
Corrected Errors • Minor/ Fast • Correction algorithm works successfully • No delay to reading later sectors • These are ok • Major / Slow • Correction algorithm works successfully • Delay in reading later sectors • Not so good • Uncorrected Errors • Correction algorithm fails • Very Bad
Other Information • Total [rereads/rewrites] – errors corrected by applying retries • Total errors corrected – number of all correctable errors • Correction Algorithm Invocation – number of times algorithm is used • Gigabytes Processed – number of bytes successfully and unsuccessfully read or written
This indicates there might be a problem This should be a flag as well This is ok, its correcting the errors and not losing any time doing so
errorsWatch Monitors • Read Uncorrected Errors • Read Delayed Errors • Read No Delay Errors • Write Uncorrected Errors • Write Delayed Errors • Write No Delay Errors • Total Uncorrected Errors • Total Delayed Errors -Noma -Don -Tori -Cob -Morab -Orlov Collects Data Once a Day