Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jinand Sangyeun Cho Dept. of Computer Science University of Pittsburgh

Processor Core Router Local L2 Cache Multicore distributed L2 caches • L2 caches typically sub-banked and distributed • IBM Power4/5: 3 banks • Sun Microsystems T1: 4 banks • Intel Itanium2 (L3): many “sub-arrays” • (Distributed L2 caches + switched NoC)  NUCA • Hardware-based management schemes • Private caching • Shared caching • Hybrid caching

Private and shared caching • Private caching: •  short hit latency (always local) •  high on-chip miss rate • long miss resolution time • complex coherence enforcement • Shared caching: • low on-chip miss rate • straightforward data location • simple coherence (no replication) • long average hit latency

Other approaches • Hybrid/flexible schemes • “Core clustering” [Speight et al., ISCA2005] • “Flexible CMP cache sharing” [Huh et al., ICS2004] • “Flexible bank mapping” [Liu et al., HPCA2004] • Improving shared caching • “Victim replication” [Zhang and Asanovic, ISCA2005] • Improving private caching • “Cooperative caching” [Chang and Sohi, ISCA2006] • “CMP-NuRAPID” [Chishti et al., ISCA2005]

Motivation Hit latency Miss rate What is the optimal balance between miss rate and hit latency?

Talk roadmap • Data mapping, a key property [cho and Jin, Micro2006] • Two-dimensional (2D) page coloring algorithm • Evaluation and results • Conclusion and future works

Data mapping • Data mapping • Memory data  location in L2 cache • Private caching • Data mapping determined by program location • Mapping created at miss time • No explicit control • Shared caching • Data mapping determined by address • slice number = (block address) % (Nslice) • Mapping is static • No explicit control

Change mapping granularity Block granularity Page granularity Page Page Page slice number = (block address) % (N slice) Page slice number = (page address) % (N slice)

OS controlled page mapping Program 1 Memory pages OS PAGE ALLOCATION OS PAGE ALLOCATION Program 2 Physical address space Virtual address space

2D page coloring: the problem Page Page Page Page Page P Network latency / hop = 3 cycles Memory latency = 300 cycles Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles)

Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 2D coloring algorithm • Collect L2 reference trace • Derive conflict information [Sherwood et al., ICS1999]

Page A Reference 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0 11

Page A Reference 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 0 0 B 1 0 0 C 1 0 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0

Page A Page B Reference 1 Reference 2 1 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 0 0 B 1 0 0 C 1 0 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0

Page A Page B Reference 1 Reference 2 +1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 0 B 1 0 0 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0 1 0

Page A Page B Page B Reference 1 Reference 2 Reference 3 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 0 B 0 0 0 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0

Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 1 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 0 B 0 0 0 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0

Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 +1 +1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 1 B 0 0 1 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0 0 0 1 1

Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 Reference Matrix A B C A 0 1 1 B 0 0 1 C 0 0 0 2D coloring algorithm (cont’d) • 2D Page coloring Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0 Access Counter A B C 1 2 1 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0

2D coloring algorithm (cont’d) • 2D Page coloring Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0 #Conflict(color) #Access Access Counter A B C 1 2 1 Cost(color, page#) = ( x mem latency) + x #hop(color) x hop delay) α x (1-α) x Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors}

Page mapping Trace Profiling 2D coloring Timing Simulation Tuning α Experiments setup • Experiments were carried out using simulator derived from SimpleScalar toolset. • The simulator models a 16-core tile-based CMP. • Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB).

Optimal page mapping α= 1/64 α= 1/256 # of pages # of pages x y y x gcc

α1/32 – 1/2048 Access distribution

Relative performance

Value of α

Conclusions • With cautious data placement, there is huge room for performance improvement. • Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform-ance improvement. • This method can also be applied to other optimization target.

Current and future works • Dynamic mapping schemes • Performance • Power • Multiprogrammed and parallel workloads

Thank you & Questions?

L1 miss Private caching • L1 miss • L2 access • Hit • Miss • Access directory • A copy on chip • Global miss •  short hit latency (always local) •  high on-chip miss rate • long miss resolution time • complex coherence enforcement Local L2 access

L1 miss Shared caching • L1 miss • L2 access • Hit • Miss • low on-chip miss rate • straightforward data location • simple coherence (no replication) • long average hit latency

150% 141% Performance improvement Over shared caching Performance

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring