330 likes | 426 Views
SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors. Lei Jin and Sangyeun Cho. Dept. of Computer Science University of Pittsburgh. Chip Multiprocessor Development.
E N D
SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh
Chip Multiprocessor Development • Cease of performance scaling of uniprocessors has turned researchers to chip multiprocessor architectures • The number of cores is increasing at a fast pace Source: Wikipedia
The CMP Cache • A CMP = N cores + one (coherent) cache system Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Cache
The CMP Cache • A CMP = N cores + one (coherent) cache system • How can one cache system sustain the growth of N cores? Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache
The CMP Cache • A CMP = N cores + one (coherent) cache system • How can one cache system sustain the growth of N cores? Core L1 I/D Cache L2 Cache Slice Directory Router • Non-Uniform Cache Architecture (NUCA) • Shared cache scheme vs. private cache scheme
Hybrid Cache Schemes • Victim Replication [Zhang and Asanovic ISCA `05] • Adaptive Selective Replication [Beckmann et al. MICRO `06] • CMP-NuRAPID [Chishti et al. ISCA `05] • Cooperative Caching [Chang and Sohi ISCA `06] • R-NUCA [Hardavelles et al. ISCA `09] • Problems with hardware-based schemes: • Hardware complexity • Limited scalability
The Challenge • CMPs provide the scalability of the core count • A cache system with scalable performance is critical in CMPs • Available hardware-based schemes failed to do so • We propose a Software-Oriented Shared (SOS) cache management approach: • Minimum hardware support • Good scalability
Our Contributions • We studied access patterns in multithreaded workloads and found they can be utilized to improve locality • We proposed the SOS scheme, which offloads the work from hardware to software analysis • We evaluated our scheme and proved that it is a promising approach
Outline • Motivation • Observation in access patterns • SOS scheme • Evaluation results • Conclusions
Observation • L2 cache access distribution of Cholesky # of access to blocks shared by 15 threads or less during whole execution. Cumulative Percentage of Accesses # of access to blocks shared by 15 threads or less simultaneously Sharer Count
Observation • L2 cache accesses are skewed at the two extremes Cumulative Percentage of Accesses ~50% highly shared access ~30% private data access Sharer Count
Access Patterns • Static data vs. dynamic data • Static data: location and size are known prior to execution (e.g. global data) • Dynamic data: location and size vary among executions, but patterns may persist (e.g. data allocated by malloc(), stack data) • Dynamic data is more important than static data • Common access patterns for dynamic data are: • Even partition • Scattered • Dominant owner • Shared
Even Partition Pattern • A continuous memory space is partitioned evenly among threads • Main thread: • Array = malloc(sizeof(int) * NumProc * N); • Thread [ProcNo]: • for(i = 0; i < N; i++) • Array[ProcNo * N + i] = x; T0 T1 T2 T3
Scattered Pattern • Memory spaces are not continuous, but each is owned by one thread • Main thread: • ArrayPtr = malloc(sizeof(int) * NumProc); • for(i = 0; i < NumProc; i++) • ArrayPtr[i] = malloc(sizeof(int) * Size[i]); • Thread [ProcNo]: • for(i = 0; i < Size[i]; i++) • ArrayPtr[ProcNo][i] = i; T2 T3 Gap T1 Gap T0
Other Patterns • Dominant owner: data are accessed by multiple threads, but one thread contributes the access significantly more than the others • Shared: data are widely shared
Outline • Motivation • Observation in access patterns • SOS scheme • Evaluation results • Conclusions
SOS Scheme • The SOS scheme consists of 3 components: Page coloring L2 Cache Access Profiling Page Clustering & Pattern Recognition Replication Run-time One-time offline analysis
Page Clustering • We take a machine-learning based approach: Per-Page Histogram C0 C1 C2 C3 P0 P3 P2 P1 P5 P6 P4 Per-thread L2 Cache Access Trace T0 K-means Clustering Pattern Recognition T3 T1 T2 Hint (Even Partition) main.c :123L C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1) Dynamic Area
Pattern Recognition • Assume a dynamic area consists of 8 pages: Pages accessed mostly by thread 0 Pages accessed mostly by thread 3 Highly shared pages Initial centroids for K-means clustering C0 C1 C2 C3 C4 P1 P0 P3 P2 P4 P6 P7 P5 C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1)
Pattern Recognition • Assume a dynamic area consists of 8 pages: Ideal Partition P0 – P1 Initial centroids for K-means clustering C0 C1 C2 C3 C4 P7 P5 P6 P4 P2 P3 P0 P1 P2 – P3 C0 (1, 0, 0, 0) C1 (0, 1, 0, 0) C2 (0, 0, 1, 0) C3 (0, 0, 0, 1) C4 (1, 1, 1, 1) Compare P4 – P5 P6 – P7
Hints Representation & Utilization • For dynamic data, pattern type is associated with every dynamic allocation system call [FileName, Line#, Pattern Type] • For static data, page location is explicitly given: [Virtual Page Num, Tile ID] • SOS data management policy: • Pattern type is translated into actual partition when the dynamic area location and size are known by the OS • Page location is assigned on demand if the partition information (hint) is available • Data without corresponding hints are treated as highly shared and distributed at block level • Data replication is enabled for shared data
Architectural Support • To allow flexible data placement in L2 cache, we add two fields in page table entry and TLB entry [Jin and Cho CMP-MSI `07, Cho and Jin MICRO `06] • The OS is responsible for providing TID and BIN • Main memory access is the same as before, with the translated physical page address • L2 cache addressing mode depends the value of TID and BIN P Virtual Page Number Physical Page Number TID BIN a TLB entry To form physical address for main memory access To locate page in L2 cache
Outline • Motivation • Observations in access patterns • SOS scheme • Evaluation results • Conclusions
Experiment Setup • We use a simics-based memory simulator, modeling a 16-tile CMP with 4x4 2D mesh on-chip network • Each core has 2-issue in-order pipeline with private L1 I/D caches and an L2 cache slice • Programs from SPLASH-2 suite and PARSEC suite are selected as benchmarks with 3 different input sizes • Small input set is used to profile and generate hints, while median and large input sets are used to evaluate the SOS performance • For brevity, we only present results of 4 representative programs (barnes, lu, cholesky, swaption) and the overall average of 14 programs
Hint Accuracy • Accuracy is measured by the percentage of pages that are placed in the tile with most access Small input Median input
Breakdown of L2 Cache Accesses • Patterns vary among different programs • A large percentage of L2 access can be tackled by page placement • The shared data are evenly distributed and handled by replication
Remote Access Comparison • Hint-guided data placement significantly reduces the number of remote cache accesses • Our SOS scheme removes nearly 87% of remote accesses!
Execution Time • Hint-guided data placement tracks private cache performance closely • SOS performs nearly 20% better than shared cache scheme
Related Work • Lu et al. PACT `09 • Analyzing the array access and performing data layout transformation to improve the data affinity • Marathe and Mueller PPoPP `06 • Profiling truncated program before every run • Deriving optimal page location based on the sampled access trace • Optimizing data locality for cc-NUMA • Hardavellas et al. ISCA `09 • Dynamic identification of private and shared pages • Private mapping for private pages and fine-grained broadcast-mapping of shared pages • Focuses of server workloads
Conclusions • We propose a software-oriented approach for shared cache management: controlling data placement and replication • This is the first work on software-managed distributed shared cache scheme for CMPs • We show that multithreaded programs exhibit data access patterns that can be exploited to improve data affinity • We demonstrate that software-oriented shared cache management is a promising approach through experiments • 19% performance improvement over shared cache scheme
Future Work • Further study of more complex access patterns can show more benefits of our software-oriented cache management scheme. • Extend the current scheme to server workloads, which exhibit totally different cache behaviors from scientific workloads.
Hint Coverage • Hint coverage measures the percentage of L2 cache accesses to the pages guided by SOS. Small input Median input