210 likes | 287 Views
Using Kernel Instrumentation to Improve CPU Scaling Strategy. Tyler Bletsch CSC890 11 May 2006. Introduction. Demand for speed dominates computer development However, energy efficiency is important for: Mobile computing: Limited battery life
E N D
Using Kernel Instrumentation to Improve CPU Scaling Strategy Tyler Bletsch CSC890 11 May 2006
Introduction • Demand for speed dominates computer development • However, energy efficiency is important for: • Mobile computing: Limited battery life • High-performance computing: Limited power supply or cooling capacity • Eric Schmidt, CEO of Google, says: • “what matters most to...Google is not speed but power — low power, because data centers can consume as much electricity as a city.” [5]
How to limit power utilization? • We can adjust the CPU frequency and voltage together in real time • The power draw of a CPU is given by [1]: P = A•C•f•V2 • P: Power (W) • A: A measure of CPU activity • C: Capacitance (constant for a given CPU) • f : Operating frequency • V : Operating voltage • Therefore, if we scale down f and V together: • Power use falls cubically • Execution time increases linearly
Cost/benefit analysis of scaling • Benefit: Scaling reduces power utilization • Cost: Scaling increases execution time • This increase not necessarily linear... • CPU is one source of power draw among many • Prudent to scale only if Benefit > Cost Full speed Reduced speed
Bottlenecks and scaling • Performance is usually limited by one factor, e.g. a task may be: • CPU bound: performance depends on the functional units of the processor • Memory bound: performance depends on getting data into and out of RAM • IO bound: performance limited by speed of disk reads/writes
Cost/benefit analysis revisited • How does time increase when we scale? • CPU bound tasks will slow proportionally • Other tasks will slow less than proportionally • Therefore: • CPU bound tasks rarely have Benefit > Cost • Other tasks often do • Intuition: Don’t narrow the bottleneck Scaling an IO bound task Scaling a CPU bound task
AMD Athlon64 machine scalable from 800-2000MHz External WattsUp Pro power meter measures wall power, reports to node Daemon integrates power over time, provides a counter of energy used since boot To test a run, we: Find time T0 and energy E0 Run the task Find time T1 and energy E1 CalculateT=T1-T0 ; E=E1-E0 System Configuration
The time/energy tradeoff • Energy savings vs. time penalty • Is saving 10 J worth waiting 60 s? • Optimize the Energy-Delay Product [1]: EDP = Energy•Time • A task is scalable if the EDP at top gear is greater than the EDP at least one lower gear. • Goal: Predict if a task is scalable before or during execution
How to predict scalability? • Determine nature of system workload from • application traces • E.g., slack time in unbalanced parallel systems [2] • hardware performance counters • E.g., CPU operations per cache miss [3] • kernel statistics • Idea: Look at kernel statistics to identify IO bound tasks
Examining the resource space • Even if a task is CPU bound, it will still use other resources • The utilization of resources can be represented by a vector in resource space • We want a task set that explores the resource space • Existing benchmarks not sufficient • Cannot compare resource usage a priori • Cannot adjust resource utilization with fine granularity • Therefore, we developed flexbench
Flexbench • Allows the user to iterate microtasks (functions that use a single resource) • Each microtask does a number of work units • A “pure” run does a single microtask • A “hybrid” run does multiple microtasks • Stresses two or more resources • Example hybrid task: for (1..1000) { for (1..50) CPU(); for (1..10) IOWriteRandom(); }
Methodology • Goal: Determine if kernel IO statistics can help us determine CPU scalability • Phase 1: Identify promising IO statistics • Phase 2: Test their predictive power
Phase 1: Identifying useful stats • Test “pure” (single-resource) flexbench suite • Run each test at each gear, measuring: • Time and energy • All available kernel IO data (/proc/diskstats) • Previously devised metrics (CPU Ops per Miss, CPU utilization) • Categorize each run as • “scalable” (gear > 0 has minimum EDP), or • “not scalable” (gear 0 has minimum EDP) • Informally find promising IO statistics • Develop hypothesis decision tree • Each branch is a comparison of a different variable
Statistics chosen: IO Util: IO Utilization ratio Ratio of time spent performing IO to real time lgOPM: log2(CPU ops per cache miss) Developed in [4], acts as a measure of “memory pressure” We take the log2 for convenience: it was shown to be linearly correlated to scaling slowdown CPU Util: CPU Utilization ratio Ratio of time spent executing on the CPU to real time Hypothesis decision tree had 100% accuracy Equivalent expression: (IO Util > X) ∨ (lgOPM < Y) ∨ (CPU Util < Z) Phase 1 results
Phase 2: Testing predictive power • Test “hybrid” (multi-resource) suite • Collect data and categorize as before • Build machine learner to find the decision tree that best? predicts scalability • Optimizes tree structure and constant values • Run the learner on the test data... • while including IO Util, and • while ignoring IO Util • Compare the accuracy? of each tree
Best: Comparing decision trees • Each tree is assigned a score based on: • The probability P+ that we scale when we should: • The probability P– of scaling when we should not: • To maximize scaling opportunities and penalize inopportune scaling: Score = P+–αP– • α is the penalty for inopportune scaling • Such scaling costs time and often energy, so we use α > 100 to disqualify solutions with P– > 0
Accuracy: Validating trees • Scoring highly on the training data doesn’t prove the solution works in general • We randomly split data into two sets: • Training set: the machine learner’s input data • Validation set: tests the resulting solution • Applying the tree to the validation set gives validation accuracy (P+ and P– values)
Phase 2 results • Learner’s decision tree matches hypothesis tree: (IO Util > X) ∨ (lgOPM < Y) ∨ (CPU Util < Z) • Effect of adding IO to learner: • Probability of missing a scaling opportunity (1–P+) falls by a factor of 33.1% • Inopportune scaling (P–) reduced by factor of 4.3 Validation accuracy of the two trees Threshold values chosen by learner
Conclusion • Kernel statistics such as the IO Utilization ratio show promise for predicting scalability • Using this statistic boosts the number of scaling cases detected while reducing inopportune scaling • Open questions: • How volatile are such statistics over time? • How can we find the optimal gear? • How relevant is the specific hardware platform? • Future work: • Examine effectiveness for other storage devices • E.g., solid-state, network file systems, etc. • Investigate other bottleneck subsystems • E.g. network bandwidth or latency, PCI bus, etc.
References [1] Mark Horowitz, Thomas Indermaur, and Ricardo Gonzalez. Low-power digital design. In Symposium on Low Power Electronics, pages 8-11, October 1994. [2] Nandini Kappiah, Vincent W. Freeh, David K. Lowenthal, and Feng Pan. Exploiting Slack Time in Power-Aware, High-Performance Programs. IEEE/ACM Supercomputing 2005 (SC|05), Seattle, WA, November, 2005. [3] Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster, Principles and Practices of Parallel Programming (PPOPP), June 2005, pp 164-173. [4] Vincent W. Freeh, Feng Pan, David K. Lowenthal, Nandini Kappiah, Rob Springer, Barry Rountree, and Mark E. Femal. Analyzing the energytime tradeoff in highperformance computing applications. Submitted to Transactions on Parallel and Distributed Systems, 2005. [5] John Markoff and Steve Lohr. Intel's huge bet turns iffy. New York Times Technology Section, September 29 2002. Section 3, Page 1, Column 2.