Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5th, 2008 Based on “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation” by Sangyeun Cho and Lei Jin, appearing in IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), December 2006.

Outline • Background and Motivation • Page Allocation • Specifics of Page Allocation • Evaluation of Page Allocation • Conclusion

Motivation • With multicore processors, on-chip memory design and management becomes crucial • Increasing L2 cache sizes result in non-uniform cache access latencies, which complicates the management of these caches

Private Caches • A cache slice is associated with a specific processor core • Data must be replicated across processors as it is accessed • Advantages? • Data is always close to the processor, reducing hit latency • Disadvantages? • Limits overall cache space, resulting in more capacity misses Blocks in memory

Shared Caches S = A mod N • Each memory block uniquely maps to one (and only one) cache slice that all processors will access • Advantages? • Increase effective L2 cache size • Easier to implement coherence protocols (data only exists in one place) • Disadvantages? • Requested data is not always close, so hit latency increases • Increase network traffic due to movement of data that is not close to requesting processor Blocks in memory

Page Allocation S = PPN mod N • Add another level of indirection – pages! • Built on top of a shared cache architecture • Use the physical page number (PPN) to map physical pages to the correct cache slice • The OS controls the mapping of virtual pages to physical pages – if the OS knows where a physical page maps to, then it can assign virtual pages based on which cache slice it desires! Pages in memory Pages in VM

How does Page Allocation work? • A Congruence Group (CGi) is the partition of physical pages that map to the unique processor core i • Each congruence group needs to maintain a “free list” of available pages • To implement private caching, when a page is requested by processor i, allocate a free page from CGi • To implement shared caching, when any page is requested, allocate a page from any CG • To implement hybrid caching, split the CGs into K groups, keeping track of which CG maps to which group – when a page is requested, allocate a page from any CG in the correct group All of this is controlled by the OS without any additional hardware support!

Page Spreading & Page Spilling • If the OS always allocates pages from the CG corresponding to the requesting processor, then it acts like a private cache. • The OS can choose to direct allocations to cache slices in other cores in order to increase the effective cache size. This is page spreading. • When available pages in a CG drop below some threshold, the OS may be forced to allocate pages from another group. This is page spilling. • Each tile is on a specific tier that corresponds to how close it is to the target tile. Tier-1 tiles

Cache Pressure • Add hardware support for counting “unique” page accesses in a cache • But we aren’t supposed to need hardware support? It still doesn’t hurt! • When cache pressure is measured to be high, pages are allocated to other tiles on the same tier, or tiles on the next tier

Home allocation policy • Profitability of choosing a home cache slice depends on different factors: • Recent miss rates of L2 caches • Recent network contention levels • Current page allocation • QoS requirements • Processor configuration (# of processors, etc.) • The OS can easily find the cache slice with the highest profitability

Virtual Multicore (VM) • For parallel applications, the OS should try to coordinate page allocation to minimize latency and traffic – schedule a parallel application onto a set of cores in close proximity • When cache pressure increases, pages can be still be allocated outside of the VM

Hardware Support • The best feature of OS-level page allocation is that it can be built on a simple shared cache organization with no hardware support • But additional hardware support can still be leveraged! • Data replication • Data migration • Bloom filter

Evaluation • Use SimpleScalar tool set to model 4x4 mesh multicore processor chip • Demand paging – every memory access is checked against allocated pages; when a memory access is the first access to an unallocated page, a physical page is allocated based on the desired policy • No page spilling was ever experienced • Used single-threaded, multiprogrammed, and parallel workloads • Single-threaded = variety of SPEC2k benchmarks, integer programs, and floating-point programs • Multiprogrammed = one core (core 5 in the experiments) runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses • Parallel = SPLASH-2 benchmarks

Performance on single-threaded workloads • PRV: private • PRV8: 8MB cache size (instead of 512k) • SL: shared • SP: OS-based page allocation • SP-RR: round-robin allocation • SP-80: 80% allocated locally, 20% spread across tier-1 cores

Performance on single-threaded workloads Decreased sharing = higher miss rate Decreased sharing = less on-chip traffic

Performance on multiprogrammed workloads • SP40-CS: use controlled spreading to limit spreading of unrelated pages onto cores that have data of target application • Synthetic benchmarks produce low, mid, or high traffic • SP40 usually performs better in high traffic, but performance is similar to SL in low traffic • Not shown here, but SP40 reduces on-chip network traffic by 50% (compared to SL)

Performance on parallel workloads • VM: virtual multicore with round-robin page allocations on participating cores • lu and ocean have higher L1 miss rates, so the L2 cache policy had a greater effect on performance No real difference here! VM outperforms the rest!

Related Issues • Remember NUMA? They used a page scanner that maintained reference counters and generated page faults to allow the OS to take some control • In CC-NUMA, hardware-based counters affected OS decisions • Big difference: NUMA deals with main memory, while OS-level page allocation presented here deals with distributed L2 caches

Conclusion • Page allocation allows for a very simple shared cache architecture, but how can we use advances in architecture for our benefit? • Architecture can provide more detailed information about current state of the cores • CMP-NuRAPID, victim replication, cooperative caching • Can we apply OS-level modifications also? • Page coloring and page recoloring • We are trading hardware complexity for software complexity – where is the right balance?

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Presentation Transcript

Thoughts on Shared Caches

Distributed Shared Memory

Distributed Resource Management: Distributed Shared Memory

Soft Error Benchmarking of L2 Caches with PARMA

Distributed Storage Allocation Problems

Distributed Shared Memory

Distributed Shared Memory

Distributed OS Introduction

An Implementation of User-level Distributed Shared Memory

Distributed shared memory

Supporting Distributed Teams: Managing Shared Understandings

Distributed Shared Memory

Adaptive Insertion Policies for Managing Shared Caches

Utility-Based Partitioning of Shared Caches

Distributed Shared Memory

Distributed Shared Memory

Distributed Shared Memory

An Implementation of User-level Distributed Shared Memory

A Design of User-Level Distributed Shared Memory

Utility-Based Partitioning of Shared Caches

Thoughts on Shared Caches

Distributed Shared Memory