Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread Allocation Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology

Performance and Power Aware CMP Thread Allocation Thread Allocation L1 L1 L1 L2 Threads L1 L1 L1 L1 L1 L1

Performance Power Trade Off Performance maximization Use all the cores High power consumption Power Minimization Single core Low performance Router $ Shared Cache Core

Performance Power Metric (PPM) Less Power ↔ More Performance (smaller α) (larger α) Preferred tradeoff between performance and power. In Short PPM

Outline • Performance and Power Model • Thread Allocation • Numerical Results

Simplified Performance ModelSingle coarse-grain multi-threaded core • The model is an extension of Agarwal`s model for asymmetric threads For simplicity we assume: • No Sharing Effect • Miss-rate doesn`t depend on the number of threads • Holds for small number of threads and large private cache • Miss rate & total memory access don`t vary by time • No context switch overhead * "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992

Terminology Single coarse-grain multi-threaded core • Thread i runs δi clocks until it suffers a L1 cache miss • T - Clocks to fetch from the shared cache Thread 1 Cache response Thread 2 Cache response $ Thread 1 Cache miss Thread 2 Cache miss 1 2 1 2 . . . Idle Time δ2 T • clocks • latency T δ1 δ1 T=h·t+ TL2$ access time Hops Hop-Latency

Memory Bound Case Thread i Performance When Core Utilization Thread 1 Cache response Thread 2 Cache response Thread 1 Cache miss Thread 2 Cache miss 1 2 1 2 . . . Idle Time δ2 T T δ1 δ1 Each thread got executed every

CPU Bound Case Thread i Performance When → Saturation Core Utilization Thread 1 Cache miss Thread 2 Cache miss Thread 1 Cache response Thread 2 Cache response δ2 T . . . . . . . . . . M 1 2 . . . . . . . . . . 1 2 . . . δ1 δ1 T Each thread is executed every clocks

Performance Per Thread Saturation Threshold 1 Hop Saturation

Performance Per Thread Saturation Threshold 1 Hop 2 Hops Saturation

Performance Per Thread Saturation Threshold 1 Hop 2 Hops More Hops Saturation

Power Model • Core power consumption: • - Power consumption of a fully utilized core • - Idle core power consumption

Outline • Performance and Power Models • Thread Allocation • Numerical Results

The Thread Allocation Problem • Given: • CMP Topology composed of M identical cores • P applications each with Ti symmetric threads (1≤i≤P). • α – Preferred tradeoff between performance and power. • Find thread allocation: which maximizes • For simplicity: • 1) We assume that • ni(c) is continuous • 2) Perform result • Discretization core index application index Average Thread Performanceα α≥1 Power PPM

Minimum Utilization (MU) • Activating a core increases the power consumption by at least Pidle • In order to justify operation of a core, an appropriate increase of the performance is required. Reminder: MU is the Minimum Utilization which justifies operating a core.

Minimum Utilization (MU)Calculation • Compare the PPM value of two cases: Threads are executed by: m cores in exactly threshold saturation and the (m+1)th core utilization equals MU Threads are executed by: m over-saturated cores = PPM2nd case PPM1st case

Minimum Utilization (MU)Calculation x 104 m=1 First core in threshold saturation 1st case: All threads are executed by single over saturated-core 2nd case: The first core in threshold saturation and the remaining threads are executed by the second core. First core in threshold saturation. Utilization of the second core equals MU Power increases by Pidle PPM (MIPS2/Power) PPM1st case = PPM2nd case Threads

Minimum Utilization (MU)Calculation • Compare the PPM value of two cases: Threads are executed by: m cores in exactly threshold saturation and the (m+1)th core utilization equals MU Threads are executed by: m over-saturated cores = 2nd case 1st case

Minimum Utilization (MU)Approximated Value and α Dependency Power is more important. Operate a core only if it`s highly utilized Minimum Utilization (%) Performance is more important. Operate a core even if its utilization is low α Less Power ↔ More Performance

The Thread Allocation AlgorithmHighlights Iterative • Iterative. • In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation. • Operating a core only if MU Threshold is achieved. Threshold ITA Algorithm

Outline • Performance and Power Model • Thread Allocation Problem • Numerical Results

How to evaluate ITA PPM ? • Compare average PPM values • ITA • Equal Utilization • Optimization algorithms • Scenarios • 2-8 cores and 2-8 applications • Using the following distributions:

Equal Utilization Comparison Average Improvement of 47% 4.7 Cores 3.6 Cores (2,5) 7.2 Cores 7.9 Cores (5,8)

Comparison with Optimization Methods • The best PPM of: • Constrained Nonlinear Optimization • Pattern Search Algorithm • Genetic Algorithm • These methods were run for 10,000X longer than ITA

Optimization Methods Compare Average Improvement of 9% ITA: 4.7 Cores Opt. Methods: 7.1 Cores ITA: 3.6 Cores Opt. Methods: 4.6 Cores (2,5) ITA: 7.1 Cores Opt. Methods: 7.9 Cores ITA: 7.9 Cores Opt. Methods: 8 Cores (5,8) Applications Cores

Summary • Tunable Performance Power Metric • Minimum Utilization concept • Approach for low computational thread allocation on CMP • Future work: • Extension for distributed cache Threads and data co-allocation • Sharing Effect Consideration • Heterogonous CMPs

Questions ?

Backup

Performance Power Metric • Follows definitions used in logic circuit design. • If E is the energy and t is the delay, Penzes & Martin ‎introduces the metric E•tα, where α becomes larger as the performance becomes more important. • * “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J • 12th ACM Great Lakes Symposium on VLSI, 2002.

Minimum Utilization (MU)Calculation Cont. MU of (m+1)th core α=1 • MU value depends on how many cores are already operating. • For large enough value of m, MU value is constant…. • Approximate constant value is reasonable… (Keep it Simple…) α=1.5 α=2 α=2.5 α=3

MU vs. Pidle/Pactive α=1 α=1.2 α=1.4 Minimum Utilization (%) α=1.6 α=1.8 α=2 Pidle/Pactive

Previous Work

Previous Work Neglecting Sharing Effect • Fedorovaet al. "Chip multithreading systems need a new operating system scheduler“ • Its goal is to highly utilize the cores. • Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the pipeline resource contention. • Neglect the sharing effect among threads (Similar to my research…) • Doesn`t take into account varying distances of cores from the L2 shared cache . • Doesn`t consider the power consumption.

Discritization • There are a lot of discritization methods … • Use Histogram Specification method (image processing)

Results DiscretizationExample D C On average, results discretization reduces PPM value by 5%. C Number of Threads D D D C C C D Core Hops Distance

Results Discretization

Initialize Current core=Closer core to the shared cache Current application= Application with highest miss rate Flow Chart Allocate threads of current application over current core until at most threshold saturation All the threads of current application were allocated? Last application? Yes, Last application Yes, Not last application Current application = Unallocated application with highest cache miss Finish No ,Yes. Last core Current core at saturation threshold? Last core ? No ,Yes. Not last core All unallocated threads achieve MU on the next available closer core? Allocate all remaining threads over already operating cores (over saturation) No Yes Finish Current core = The next available closer core to the shared cache

Time Complexity ComparisonOptimization methods to ITA Ratio ITA Operations / Minimum of Optimization Methods Operations ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the optimization methods. It outperforms the best of optimization methods by 9%. Applications Cores

Ratio of memory access instructions out of the total instruction mix of thread i Cache miss rate for thread i

Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread Allocation

Presentation Transcript

Power-aware scheduling

MU-OFDM Power and Subchannel Allocation Performance on DSP

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance

Power-aware Resource Allocation for Cpu - and Memory Intense Internet Services

Latency-aware and performance-preserving Power Capping

High-Performance, Power-Aware Computing

Power Aware Computing

Queue-Aware Subchannel and Power Allocation for Downlink OFDM-Based Cognitive Radio Networks

Power-Aware Placement

Power-Aware Service Allocation and Reallocation in Clusters and Clouds

Performance-based allocation*

Optimizing single thread performance

Power Aware Synthesis

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors

Cost Allocation and Performance Measurement

Power-Aware Architecture

Power-Aware Placement

Power-Aware Microprocessors

Power-aware scheduling

Power Aware Computing

High-Performance, Power-Aware Computing

Power Aware Computing