400 likes | 575 Views
Performance and Power Aware CMP Thread Allocation. Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology. Performance and Power Aware CMP Thread Allocation. Thread Allocation. L1. L1. L1. L2. Threads.
E N D
Performance and Power Aware CMP Thread Allocation Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology
Performance and Power Aware CMP Thread Allocation Thread Allocation L1 L1 L1 L2 Threads L1 L1 L1 L1 L1 L1
Performance Power Trade Off Performance maximization Use all the cores High power consumption Power Minimization Single core Low performance Router $ Shared Cache Core
Performance Power Metric (PPM) Less Power ↔ More Performance (smaller α) (larger α) Preferred tradeoff between performance and power. In Short PPM
Outline • Performance and Power Model • Thread Allocation • Numerical Results
Simplified Performance ModelSingle coarse-grain multi-threaded core • The model is an extension of Agarwal`s model for asymmetric threads For simplicity we assume: • No Sharing Effect • Miss-rate doesn`t depend on the number of threads • Holds for small number of threads and large private cache • Miss rate & total memory access don`t vary by time • No context switch overhead * "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992
Terminology Single coarse-grain multi-threaded core • Thread i runs δi clocks until it suffers a L1 cache miss • T - Clocks to fetch from the shared cache Thread 1 Cache response Thread 2 Cache response $ Thread 1 Cache miss Thread 2 Cache miss 1 2 1 2 . . . Idle Time δ2 T • clocks • latency T δ1 δ1 T=h·t+ TL2$ access time Hops Hop-Latency
Memory Bound Case Thread i Performance When Core Utilization Thread 1 Cache response Thread 2 Cache response Thread 1 Cache miss Thread 2 Cache miss 1 2 1 2 . . . Idle Time δ2 T T δ1 δ1 Each thread got executed every
CPU Bound Case Thread i Performance When → Saturation Core Utilization Thread 1 Cache miss Thread 2 Cache miss Thread 1 Cache response Thread 2 Cache response δ2 T . . . . . . . . . . M 1 2 . . . . . . . . . . 1 2 . . . δ1 δ1 T Each thread is executed every clocks
Performance Per Thread Saturation Threshold 1 Hop Saturation
Performance Per Thread Saturation Threshold 1 Hop 2 Hops Saturation
Performance Per Thread Saturation Threshold 1 Hop 2 Hops More Hops Saturation
Power Model • Core power consumption: • - Power consumption of a fully utilized core • - Idle core power consumption
Outline • Performance and Power Models • Thread Allocation • Numerical Results
The Thread Allocation Problem • Given: • CMP Topology composed of M identical cores • P applications each with Ti symmetric threads (1≤i≤P). • α – Preferred tradeoff between performance and power. • Find thread allocation: which maximizes • For simplicity: • 1) We assume that • ni(c) is continuous • 2) Perform result • Discretization core index application index Average Thread Performanceα α≥1 Power PPM
Minimum Utilization (MU) • Activating a core increases the power consumption by at least Pidle • In order to justify operation of a core, an appropriate increase of the performance is required. Reminder: MU is the Minimum Utilization which justifies operating a core.
Minimum Utilization (MU)Calculation • Compare the PPM value of two cases: Threads are executed by: m cores in exactly threshold saturation and the (m+1)th core utilization equals MU Threads are executed by: m over-saturated cores = PPM2nd case PPM1st case
Minimum Utilization (MU)Calculation x 104 m=1 First core in threshold saturation 1st case: All threads are executed by single over saturated-core 2nd case: The first core in threshold saturation and the remaining threads are executed by the second core. First core in threshold saturation. Utilization of the second core equals MU Power increases by Pidle PPM (MIPS2/Power) PPM1st case = PPM2nd case Threads
Minimum Utilization (MU)Calculation • Compare the PPM value of two cases: Threads are executed by: m cores in exactly threshold saturation and the (m+1)th core utilization equals MU Threads are executed by: m over-saturated cores = 2nd case 1st case
Minimum Utilization (MU)Approximated Value and α Dependency Power is more important. Operate a core only if it`s highly utilized Minimum Utilization (%) Performance is more important. Operate a core even if its utilization is low α Less Power ↔ More Performance
The Thread Allocation AlgorithmHighlights Iterative • Iterative. • In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation. • Operating a core only if MU Threshold is achieved. Threshold ITA Algorithm
Outline • Performance and Power Model • Thread Allocation Problem • Numerical Results
How to evaluate ITA PPM ? • Compare average PPM values • ITA • Equal Utilization • Optimization algorithms • Scenarios • 2-8 cores and 2-8 applications • Using the following distributions:
Equal Utilization Comparison Average Improvement of 47% 4.7 Cores 3.6 Cores (2,5) 7.2 Cores 7.9 Cores (5,8)
Comparison with Optimization Methods • The best PPM of: • Constrained Nonlinear Optimization • Pattern Search Algorithm • Genetic Algorithm • These methods were run for 10,000X longer than ITA
Optimization Methods Compare Average Improvement of 9% ITA: 4.7 Cores Opt. Methods: 7.1 Cores ITA: 3.6 Cores Opt. Methods: 4.6 Cores (2,5) ITA: 7.1 Cores Opt. Methods: 7.9 Cores ITA: 7.9 Cores Opt. Methods: 8 Cores (5,8) Applications Cores
Summary • Tunable Performance Power Metric • Minimum Utilization concept • Approach for low computational thread allocation on CMP • Future work: • Extension for distributed cache Threads and data co-allocation • Sharing Effect Consideration • Heterogonous CMPs
Performance Power Metric • Follows definitions used in logic circuit design. • If E is the energy and t is the delay, Penzes & Martin introduces the metric E•tα, where α becomes larger as the performance becomes more important. • * “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J • 12th ACM Great Lakes Symposium on VLSI, 2002.
Minimum Utilization (MU)Calculation Cont. MU of (m+1)th core α=1 • MU value depends on how many cores are already operating. • For large enough value of m, MU value is constant…. • Approximate constant value is reasonable… (Keep it Simple…) α=1.5 α=2 α=2.5 α=3
MU vs. Pidle/Pactive α=1 α=1.2 α=1.4 Minimum Utilization (%) α=1.6 α=1.8 α=2 Pidle/Pactive
Previous Work Neglecting Sharing Effect • Fedorovaet al. "Chip multithreading systems need a new operating system scheduler“ • Its goal is to highly utilize the cores. • Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the pipeline resource contention. • Neglect the sharing effect among threads (Similar to my research…) • Doesn`t take into account varying distances of cores from the L2 shared cache . • Doesn`t consider the power consumption.
Discritization • There are a lot of discritization methods … • Use Histogram Specification method (image processing)
Results DiscretizationExample D C On average, results discretization reduces PPM value by 5%. C Number of Threads D D D C C C D Core Hops Distance
Initialize Current core=Closer core to the shared cache Current application= Application with highest miss rate Flow Chart Allocate threads of current application over current core until at most threshold saturation All the threads of current application were allocated? Last application? Yes, Last application Yes, Not last application Current application = Unallocated application with highest cache miss Finish No ,Yes. Last core Current core at saturation threshold? Last core ? No ,Yes. Not last core All unallocated threads achieve MU on the next available closer core? Allocate all remaining threads over already operating cores (over saturation) No Yes Finish Current core = The next available closer core to the shared cache
Time Complexity ComparisonOptimization methods to ITA Ratio ITA Operations / Minimum of Optimization Methods Operations ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the optimization methods. It outperforms the best of optimization methods by 9%. Applications Cores
Ratio of memory access instructions out of the total instruction mix of thread i Cache miss rate for thread i