210 likes | 336 Views
Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida. Outline. Motivation and Problem Statement Existing Solutions What we propose to achieve Feasibility Study Potential Impact Conclusion. Motivation.
E N D
Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida
Outline Motivation and Problem Statement Existing Solutions What we propose to achieve Feasibility Study Potential Impact Conclusion
Motivation More cores are integrated on die (e.g. Intel Tera Scale computing) - Multitasking becomes more common - Multiple applications are running simultaneously - Virtualized workload becomes mainstream; multiple VMs are consolidated onto the same platform Problems in platform resource management - Loss of efficiency Disparate behavior of simultaneously running applications - No fair or determinism guaranteed - No effective prioritization
Motivation – An Example We have a shared L2 cache Operating System wants to prioritize certain front-end application which is directly visible to the user It usually does so by increase time slice of the application But another process with a poor temporal locality is running on the system. It trashes the L2 cache and hence more pages of that program is present in shared L2 Higher priority process spends most of its time evicting pages of the other process
Problem Statement We need to modify existing cache management to include a notion of fairness and throughput Throughput improvements dominated cache management protocols as we tend to reduce the number of misses But increasing throughput should not result in increasing the cache hit for the a resource hogging process to such a degree that other processes results in excessive cache trashing
Problem Statement Not a new problem as this was the case with traditional multi- tasking environment Now the degree of multi-tasking has increased due to an increase in the number of cores and sharing some part of cache hierarchy amongst them Shared L2 cache is more useful when processes share data between them. If we don’t share then we will be ending up with many copies of shared data inside private L2
Existing Solution Profiling Based Approach Non-uniform cache architecture Partially shared cache hierarchy Marginal Gain based approach Fairness based approach Resource QoS based approach
Profiling based approach Profile various application for their cache access when run alone Determine the optimum cache size for each of them given the total cache size available Currently only maximizes the sum of cache hits of all processes Need to incorporate fairness criteria and maximize them along with throughput.
Non Uniform Cache Architecture Arises due to the large wire-delay dominated L2 cache resulting in non-uniform access time for different parts of the cache depending on the distance from the processor. We need to place the locally accessed blocks nearer to the processor while placing shared blocks optimally from the processors sharing the blocks. Divides the cache banks according to the processors. Dynamically controlling the granularity provides performance improvements
Partially Shared Cache Aims to combine the best of two worlds – Private L2 cache and shared L2 cache. One can keep a fraction of the L2 Cache as a private cache and rest as shared. Basically boils down to a private L2 and shared L3 cache with equivalent access time for L2 and L3. Coherency protocols need to include a state for “shared” blocks. Put the evicted blocks from a private L2 to private L2 of some other processor so that next time they can be fetched from that processor instead of going into memory. Would be great to place the blocks on those processors which might share them with the other processor in future.
Marginal Gain based approach Based on the concept of reuse distance. Normally we can calculate the stack based profiling after running the applications If the stack based profile curve is convex one can apply resource minimization procedure to find the optimal cache size of individual caches ( Note: we minimize the sum of all cache miss here) Can be implemented on-the-fly by calculating marginal gain (increase in the cache hit if cache size is increased by 1) and using LRU stack and counters to figure out the partitioning scheme
Fairness based approach Define fairness metrics based on the cache access pattern while running along other processes vis-à-vis when running independently During Context switching, calculate the position of the every process with respect to the fairness metric. Increase partition size of the processes having low value of fairness metric Need to develop a model where show the metrics indeed leads to fairness given a stream with certain mathematical properties
Resource QoS based approach Incorporate the notion of the priority starting from application down to every resources in the system
What we propose to achieve Propose a strategy which maximizes both fairness and total number of cache hits. Should not depend on OS to provide priority data as it should be implemented by hardware cache controller. This implies we can not enforce priority for a process but we will aim to guarantee fairness. Should use counters and LRU stack and need not to run the application aprior One interesting aspect could be to profile cache coherence protocol to understand the behavior of various processes running on different cores. This was not possible for multi-tasking systems with a single core (don’t have any coherency protocol)
What we propose to achieve Mathematically, we have reuse distance curve for every process
What we propose to achieve If we want to maximize throughput, we have to maximize the sum of all those area under the curve given sum of the sizes is a constant We need to take account of fairness in this situation We can conceptualize total cache space allocated to a process consists of exclusive cache space for the process and shared address space which is shared with all other processes Fairness is violated when shared cache spaces are used unevenly by different processes
What we propose to achieve We start with a shared space assignment. We vary the shared space depending on the shared blocks present in the cache We can determine the degree of “sharedness” of the cache using the state of the cache blocks We need to maintain a counter which gets updated with the state change of the block according to the cache coherence protocol
What we propose to achieve To maximize the throughput we need to have the reuse distance which implies running the application once to produce it. But this is not really required as marginal gains approximate the curve We start with say an equal partition and then after each block state (many??) we recalculate the “sharedness” and “throughput” criteria to repartition it.