10 likes | 156 Views
Normalized Energy and Delay with CPU MISER for FT.C.8 . normalized delay. 1.20. normalized energy. 1.00. 0.80. 0.60. 0.40. 0.20. 0.00. auto. 600. 800. 1000. 1200. 1400. CPU MISER. Improving Performance, Power, and Thermal Efficiency in High-End Systems. Kirk W. Cameron.
E N D
Normalized Energy and Delay with CPU MISER for FT.C.8 normalized delay 1.20 normalized energy 1.00 0.80 0.60 0.40 0.20 0.00 auto 600 800 1000 1200 1400 CPU MISER Improving Performance, Power, and Thermal Efficiency in High-End Systems Kirk W. Cameron Scalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech cameron@ cs.vt.edu Power Efficiency Introduction PowerPack II Software Power profiling API library - synchronized profiling of parallel applications. Power control API library - synchronized DVS control within parallel application. Multimeter middleware - coordinates data from multiple meter sources. Power analyzer middleware – sorts/sifts/analyzes/correlates profiling data. Performance profiler – use common utilities to poll system performance status. Problem Statement Left unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling. Performance, Power, and Thermal Facts The gap between peak and achieved performance is growing A 5 Megawatt Supercomputer can consume $4M in energy annually. In just 2 hours, Earth Simulator can produce enough heat to heat a home in the midwest all winter long. Projections Commodity components fail at annual rate of 2-3%. Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk) will sustain hardware failure once every 24 hours. Life expectancy of an electronic component decreases 50% for every 10°C(18°F) temperature increase. Relevant approaches to the problem Improving Performance Efficiencies Includes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture. Improving Power Efficiencies Exploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications. Improving Thermal Efficiencies Exploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications. Multimeters Resistors Node under test Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio. V V R S R + - RS232/GBIC Component P =(V - V )V /R Component S R R Ethernet Ethernet There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future HEC systems will waste performance potential, waste energy, and require extravagant cooling. 8-node Dori Ethernet Switch Data Collection System Our Approach Observations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies. Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly. • Our Contributions • Portable framework to profile, analyze and optimize distributed applications for performance, power, and thermals with minimal performance impact. • Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures. Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss. Thermal Efficiency Performance Efficiency Optimizing Heterogeneous Multicore Systems We use a variation of the lognP performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos. Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications. Tempest profiling techniques are automatic, accurate, and portable. Detailed thermal profile of FT (Class C,NP=4) Thermal regulation of FT (Class C, NP=4) Thermal regulation of IS (Class C, NP=4) Tempest Software Architecture Single APU: TAPU = TAPUp + CAPU TAPUp: APU part that can be parallelized CAPU: APU sequential part Multiple APUs: TAPU(1,p) = TAPU(1,1)/p + CAPU p: number of APUs TAPU(1,1): offloaded time for 1 APU TAPU(1,p): offloaded time for p APUs T = THPU + TAPU(1,1)/p + CAPU + Ooffload + p·g HPU time for one iteration: THPU(m,1) = am · THPU(1,1) + TCSW + Ocol T(m,p) = THPU(m,p) + TAPU(m,p) + Ooffload + p·g Time for a single iteration: Ti = THPU + TAPU + Offload Off-loaded time:Offload = Or + Os Total time: T = ∑i(THPU,i + TAPU,i + Ooffload,i) Distributed Thermal Profiles:A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest. Thermal regulation:(top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes. Performance analysis of NAS parallel benchmarks Avg CPU Temp for various NAS PB codes Thermal-aware Performance Impact:(right) The performance impact of our thermal-aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C. • Application: Parallel Bayesian Phylogenetic Inference • Dataset: 107 sequences, each 10000 nucleotides, 20,000 gens • MMGP mean error 3.2%, std. dev. 2.6, max. error 10% CPU Impact on Thermals:(left)For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly. • PBPI executes sampling phase at the beginning of execution • MMGP params are determined during the sampling phase • Execution restarted after the sampling phase with MMGP PBPI with sampling phase outperforms other configsby 1% - 4x. Sampling phase overhead is 2.5%. Download Tempest Tempest is available for download from http://sourceforge.net . Related papers can be found at http://scape.cs.vt.edu . Temperature-Performance tradeoffs Thermal optimizations are achieved with minimal performance impact This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER25608.