1 / 19

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. Jason Bosko March 5 th , 2008.

ivi
Download Presentation

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5th, 2008 Based on “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation” by Sangyeun Cho and Lei Jin, appearing in IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), December 2006.

  2. Outline • Background and Motivation • Page Allocation • Specifics of Page Allocation • Evaluation of Page Allocation • Conclusion

  3. Motivation • With multicore processors, on-chip memory design and management becomes crucial • Increasing L2 cache sizes result in non-uniform cache access latencies, which complicates the management of these caches

  4. Private Caches • A cache slice is associated with a specific processor core • Data must be replicated across processors as it is accessed • Advantages? • Data is always close to the processor, reducing hit latency • Disadvantages? • Limits overall cache space, resulting in more capacity misses Blocks in memory

  5. Shared Caches S = A mod N • Each memory block uniquely maps to one (and only one) cache slice that all processors will access • Advantages? • Increase effective L2 cache size • Easier to implement coherence protocols (data only exists in one place) • Disadvantages? • Requested data is not always close, so hit latency increases • Increase network traffic due to movement of data that is not close to requesting processor Blocks in memory

  6. Page Allocation S = PPN mod N • Add another level of indirection – pages! • Built on top of a shared cache architecture • Use the physical page number (PPN) to map physical pages to the correct cache slice • The OS controls the mapping of virtual pages to physical pages – if the OS knows where a physical page maps to, then it can assign virtual pages based on which cache slice it desires! Pages in memory Pages in VM

  7. How does Page Allocation work? • A Congruence Group (CGi) is the partition of physical pages that map to the unique processor core i • Each congruence group needs to maintain a “free list” of available pages • To implement private caching, when a page is requested by processor i, allocate a free page from CGi • To implement shared caching, when any page is requested, allocate a page from any CG • To implement hybrid caching, split the CGs into K groups, keeping track of which CG maps to which group – when a page is requested, allocate a page from any CG in the correct group All of this is controlled by the OS without any additional hardware support!

  8. Page Spreading & Page Spilling • If the OS always allocates pages from the CG corresponding to the requesting processor, then it acts like a private cache. • The OS can choose to direct allocations to cache slices in other cores in order to increase the effective cache size. This is page spreading. • When available pages in a CG drop below some threshold, the OS may be forced to allocate pages from another group. This is page spilling. • Each tile is on a specific tier that corresponds to how close it is to the target tile. Tier-1 tiles

  9. Cache Pressure • Add hardware support for counting “unique” page accesses in a cache • But we aren’t supposed to need hardware support? It still doesn’t hurt! • When cache pressure is measured to be high, pages are allocated to other tiles on the same tier, or tiles on the next tier

  10. Home allocation policy • Profitability of choosing a home cache slice depends on different factors: • Recent miss rates of L2 caches • Recent network contention levels • Current page allocation • QoS requirements • Processor configuration (# of processors, etc.) • The OS can easily find the cache slice with the highest profitability

  11. Virtual Multicore (VM) • For parallel applications, the OS should try to coordinate page allocation to minimize latency and traffic – schedule a parallel application onto a set of cores in close proximity • When cache pressure increases, pages can be still be allocated outside of the VM

  12. Hardware Support • The best feature of OS-level page allocation is that it can be built on a simple shared cache organization with no hardware support • But additional hardware support can still be leveraged! • Data replication • Data migration • Bloom filter

  13. Evaluation • Use SimpleScalar tool set to model 4x4 mesh multicore processor chip • Demand paging – every memory access is checked against allocated pages; when a memory access is the first access to an unallocated page, a physical page is allocated based on the desired policy • No page spilling was ever experienced • Used single-threaded, multiprogrammed, and parallel workloads • Single-threaded = variety of SPEC2k benchmarks, integer programs, and floating-point programs • Multiprogrammed = one core (core 5 in the experiments) runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses • Parallel = SPLASH-2 benchmarks

  14. Performance on single-threaded workloads • PRV: private • PRV8: 8MB cache size (instead of 512k) • SL: shared • SP: OS-based page allocation • SP-RR: round-robin allocation • SP-80: 80% allocated locally, 20% spread across tier-1 cores

  15. Performance on single-threaded workloads Decreased sharing = higher miss rate Decreased sharing = less on-chip traffic

  16. Performance on multiprogrammed workloads • SP40-CS: use controlled spreading to limit spreading of unrelated pages onto cores that have data of target application • Synthetic benchmarks produce low, mid, or high traffic • SP40 usually performs better in high traffic, but performance is similar to SL in low traffic • Not shown here, but SP40 reduces on-chip network traffic by 50% (compared to SL)

  17. Performance on parallel workloads • VM: virtual multicore with round-robin page allocations on participating cores • lu and ocean have higher L1 miss rates, so the L2 cache policy had a greater effect on performance No real difference here! VM outperforms the rest!

  18. Related Issues • Remember NUMA? They used a page scanner that maintained reference counters and generated page faults to allow the OS to take some control • In CC-NUMA, hardware-based counters affected OS decisions • Big difference: NUMA deals with main memory, while OS-level page allocation presented here deals with distributed L2 caches

  19. Conclusion • Page allocation allows for a very simple shared cache architecture, but how can we use advances in architecture for our benefit? • Architecture can provide more detailed information about current state of the cores • CMP-NuRAPID, victim replication, cooperative caching • Can we apply OS-level modifications also? • Page coloring and page recoloring • We are trading hardware complexity for software complexity – where is the right balance?

More Related