SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh

Chip Multiprocessor Development • Cease of performance scaling of uniprocessors has turned researchers to chip multiprocessor architectures • The number of cores is increasing at a fast pace Source: Wikipedia

The CMP Cache • A CMP = N cores + one (coherent) cache system Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Cache

The CMP Cache • A CMP = N cores + one (coherent) cache system • How can one cache system sustain the growth of N cores? Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache

The CMP Cache • A CMP = N cores + one (coherent) cache system • How can one cache system sustain the growth of N cores? Core L1 I/D Cache L2 Cache Slice Directory Router • Non-Uniform Cache Architecture (NUCA) • Shared cache scheme vs. private cache scheme

Hybrid Cache Schemes • Victim Replication [Zhang and Asanovic ISCA `05] • Adaptive Selective Replication [Beckmann et al. MICRO `06] • CMP-NuRAPID [Chishti et al. ISCA `05] • Cooperative Caching [Chang and Sohi ISCA `06] • R-NUCA [Hardavelles et al. ISCA `09] • Problems with hardware-based schemes: • Hardware complexity • Limited scalability

The Challenge • CMPs provide the scalability of the core count • A cache system with scalable performance is critical in CMPs • Available hardware-based schemes failed to do so • We propose a Software-Oriented Shared (SOS) cache management approach: • Minimum hardware support • Good scalability

Our Contributions • We studied access patterns in multithreaded workloads and found they can be utilized to improve locality • We proposed the SOS scheme, which offloads the work from hardware to software analysis • We evaluated our scheme and proved that it is a promising approach

Outline • Motivation • Observation in access patterns • SOS scheme • Evaluation results • Conclusions

Observation • L2 cache access distribution of Cholesky # of access to blocks shared by 15 threads or less during whole execution. Cumulative Percentage of Accesses # of access to blocks shared by 15 threads or less simultaneously Sharer Count

Observation • L2 cache accesses are skewed at the two extremes Cumulative Percentage of Accesses ~50% highly shared access ~30% private data access Sharer Count

Access Patterns • Static data vs. dynamic data • Static data: location and size are known prior to execution (e.g. global data) • Dynamic data: location and size vary among executions, but patterns may persist (e.g. data allocated by malloc(), stack data) • Dynamic data is more important than static data • Common access patterns for dynamic data are: • Even partition • Scattered • Dominant owner • Shared

Even Partition Pattern • A continuous memory space is partitioned evenly among threads • Main thread: • Array = malloc(sizeof(int) * NumProc * N); • Thread [ProcNo]: • for(i = 0; i < N; i++) • Array[ProcNo * N + i] = x; T0 T1 T2 T3

Scattered Pattern • Memory spaces are not continuous, but each is owned by one thread • Main thread: • ArrayPtr = malloc(sizeof(int) * NumProc); • for(i = 0; i < NumProc; i++) • ArrayPtr[i] = malloc(sizeof(int) * Size[i]); • Thread [ProcNo]: • for(i = 0; i < Size[i]; i++) • ArrayPtr[ProcNo][i] = i; T2 T3 Gap T1 Gap T0

Other Patterns • Dominant owner: data are accessed by multiple threads, but one thread contributes the access significantly more than the others • Shared: data are widely shared

Outline • Motivation • Observation in access patterns • SOS scheme • Evaluation results • Conclusions

SOS Scheme • The SOS scheme consists of 3 components: Page coloring L2 Cache Access Profiling Page Clustering & Pattern Recognition Replication Run-time One-time offline analysis

Page Clustering • We take a machine-learning based approach: Per-Page Histogram C0 C1 C2 C3 P0 P3 P2 P1 P5 P6 P4 Per-thread L2 Cache Access Trace T0 K-means Clustering Pattern Recognition T3 T1 T2 Hint (Even Partition) main.c :123L C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1) Dynamic Area

Pattern Recognition • Assume a dynamic area consists of 8 pages: Pages accessed mostly by thread 0 Pages accessed mostly by thread 3 Highly shared pages Initial centroids for K-means clustering C0 C1 C2 C3 C4 P1 P0 P3 P2 P4 P6 P7 P5 C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1)

Pattern Recognition • Assume a dynamic area consists of 8 pages: Ideal Partition P0 – P1 Initial centroids for K-means clustering C0 C1 C2 C3 C4 P7 P5 P6 P4 P2 P3 P0 P1 P2 – P3 C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1) Compare P4 – P5 P6 – P7

Hints Representation & Utilization • For dynamic data, pattern type is associated with every dynamic allocation system call [FileName, Line#, Pattern Type] • For static data, page location is explicitly given: [Virtual Page Num, Tile ID] • SOS data management policy: • Pattern type is translated into actual partition when the dynamic area location and size are known by the OS • Page location is assigned on demand if the partition information (hint) is available • Data without corresponding hints are treated as highly shared and distributed at block level • Data replication is enabled for shared data

Architectural Support • To allow flexible data placement in L2 cache, we add two fields in page table entry and TLB entry [Jin and Cho CMP-MSI `07, Cho and Jin MICRO `06] • The OS is responsible for providing TID and BIN • Main memory access is the same as before, with the translated physical page address • L2 cache addressing mode depends the value of TID and BIN P Virtual Page Number Physical Page Number TID BIN a TLB entry To form physical address for main memory access To locate page in L2 cache

Outline • Motivation • Observations in access patterns • SOS scheme • Evaluation results • Conclusions

Experiment Setup • We use a simics-based memory simulator, modeling a 16-tile CMP with 4x4 2D mesh on-chip network • Each core has 2-issue in-order pipeline with private L1 I/D caches and an L2 cache slice • Programs from SPLASH-2 suite and PARSEC suite are selected as benchmarks with 3 different input sizes • Small input set is used to profile and generate hints, while median and large input sets are used to evaluate the SOS performance • For brevity, we only present results of 4 representative programs (barnes, lu, cholesky, swaption) and the overall average of 14 programs

Hint Accuracy • Accuracy is measured by the percentage of pages that are placed in the tile with most access Small input Median input

Breakdown of L2 Cache Accesses • Patterns vary among different programs • A large percentage of L2 access can be tackled by page placement • The shared data are evenly distributed and handled by replication

Remote Access Comparison • Hint-guided data placement significantly reduces the number of remote cache accesses • Our SOS scheme removes nearly 87% of remote accesses!

Execution Time • Hint-guided data placement tracks private cache performance closely • SOS performs nearly 20% better than shared cache scheme

Related Work • Lu et al. PACT `09 • Analyzing the array access and performing data layout transformation to improve the data affinity • Marathe and Mueller PPoPP `06 • Profiling truncated program before every run • Deriving optimal page location based on the sampled access trace • Optimizing data locality for cc-NUMA • Hardavellas et al. ISCA `09 • Dynamic identification of private and shared pages • Private mapping for private pages and fine-grained broadcast-mapping of shared pages • Focuses of server workloads

Conclusions • We propose a software-oriented approach for shared cache management: controlling data placement and replication • This is the first work on software-managed distributed shared cache scheme for CMPs • We show that multithreaded programs exhibit data access patterns that can be exploited to improve data affinity • We demonstrate that software-oriented shared cache management is a promising approach through experiments • 19% performance improvement over shared cache scheme

Thank you and Questions?

Future Work • Further study of more complex access patterns can show more benefits of our software-oriented cache management scheme. • Extend the current scheme to server workloads, which exhibit totally different cache behaviors from scientific workloads.

Hint Coverage • Hint coverage measures the percentage of L2 cache accesses to the pages guided by SOS. Small input Median input

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors

Presentation Transcript

Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Applying Control Theory to the Caches of Multiprocessors

Chip Multiprocessors, Introduction and Challenges Lecture 8 February 6, 2013 Mohammad Hammoud

Shared Last-Level TLBs for Chip Multiprocessors

Cellular Disco: resource management using virtual clusters on shared memory multiprocessors

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

The Cache-Coherence Problem

Power Control for Chip Multiprocessors

MULTIPROCESSORS

Cache Coherence in Shared Memory Multiprocessors

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture

Distributed Memory Machines and Programming

Distributed Shared Memory

Ch 10

Distributed Resource Management: Distributed Shared Memory

Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors