1 / 31

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring. Lei Jin and Sangyeun Cho. Dept. of Computer Science University of Pittsburgh. Processor Core. Router. Local L2 Cache. Multicore distributed L2 caches.

myron
Download Presentation

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jinand Sangyeun Cho Dept. of Computer Science University of Pittsburgh

  2. Processor Core Router Local L2 Cache Multicore distributed L2 caches • L2 caches typically sub-banked and distributed • IBM Power4/5: 3 banks • Sun Microsystems T1: 4 banks • Intel Itanium2 (L3): many “sub-arrays” • (Distributed L2 caches + switched NoC)  NUCA • Hardware-based management schemes • Private caching • Shared caching • Hybrid caching

  3. Private and shared caching • Private caching: •  short hit latency (always local) •  high on-chip miss rate • long miss resolution time • complex coherence enforcement • Shared caching: • low on-chip miss rate • straightforward data location • simple coherence (no replication) • long average hit latency

  4. Other approaches • Hybrid/flexible schemes • “Core clustering” [Speight et al., ISCA2005] • “Flexible CMP cache sharing” [Huh et al., ICS2004] • “Flexible bank mapping” [Liu et al., HPCA2004] • Improving shared caching • “Victim replication” [Zhang and Asanovic, ISCA2005] • Improving private caching • “Cooperative caching” [Chang and Sohi, ISCA2006] • “CMP-NuRAPID” [Chishti et al., ISCA2005]

  5. Motivation Hit latency Miss rate What is the optimal balance between miss rate and hit latency?

  6. Talk roadmap • Data mapping, a key property [cho and Jin, Micro2006] • Two-dimensional (2D) page coloring algorithm • Evaluation and results • Conclusion and future works

  7. Data mapping • Data mapping • Memory data  location in L2 cache • Private caching • Data mapping determined by program location • Mapping created at miss time • No explicit control • Shared caching • Data mapping determined by address • slice number = (block address) % (Nslice) • Mapping is static • No explicit control

  8. Change mapping granularity Block granularity Page granularity Page Page Page slice number = (block address) % (N slice) Page slice number = (page address) % (N slice)

  9. OS controlled page mapping Program 1 Memory pages OS PAGE ALLOCATION OS PAGE ALLOCATION Program 2 Physical address space Virtual address space

  10. 2D page coloring: the problem Page Page Page Page Page P Network latency / hop = 3 cycles Memory latency = 300 cycles Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles)

  11. Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 2D coloring algorithm • Collect L2 reference trace • Derive conflict information [Sherwood et al., ICS1999]

  12. Page A Reference 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0 11

  13. Page A Reference 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 0 0 B 1 0 0 C 1 0 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0

  14. Page A Page B Reference 1 Reference 2 1 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 0 0 B 1 0 0 C 1 0 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0

  15. Page A Page B Reference 1 Reference 2 +1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 0 B 1 0 0 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 0 0 0 C 0 0 0 1 0

  16. Page A Page B Page B Reference 1 Reference 2 Reference 3 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 0 B 0 0 0 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0

  17. Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 1 1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 0 B 0 0 0 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0

  18. Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 +1 +1 2D coloring algorithm (cont’d) • Derive conflict information Reference Matrix A B C A 0 1 1 B 0 0 1 C 1 1 0 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 0 0 0 0 0 1 1

  19. Page A Page B Page B Page C Reference 1 Reference 2 Reference 3 Reference 4 Reference Matrix A B C A 0 1 1 B 0 0 1 C 0 0 0 2D coloring algorithm (cont’d) • 2D Page coloring Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0 Access Counter A B C 1 2 1 Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0

  20. 2D coloring algorithm (cont’d) • 2D Page coloring Conflict Matrix A B C A 0 0 0 B 1 0 0 C 1 1 0 #Conflict(color) #Access Access Counter A B C 1 2 1 Cost(color, page#) = ( x mem latency) + x #hop(color) x hop delay) α x (1-α) x Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors}

  21. Page mapping Trace Profiling 2D coloring Timing Simulation Tuning α Experiments setup • Experiments were carried out using simulator derived from SimpleScalar toolset. • The simulator models a 16-core tile-based CMP. • Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB).

  22. Optimal page mapping α= 1/64 α= 1/256 # of pages # of pages x y y x gcc

  23. α1/32 – 1/2048 Access distribution

  24. Relative performance

  25. Value of α

  26. Conclusions • With cautious data placement, there is huge room for performance improvement. • Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform-ance improvement. • This method can also be applied to other optimization target.

  27. Current and future works • Dynamic mapping schemes • Performance • Power • Multiprogrammed and parallel workloads

  28. Thank you & Questions?

  29. L1 miss Private caching • L1 miss • L2 access • Hit • Miss • Access directory • A copy on chip • Global miss •  short hit latency (always local) •  high on-chip miss rate • long miss resolution time • complex coherence enforcement Local L2 access

  30. L1 miss Shared caching • L1 miss • L2 access • Hit • Miss • low on-chip miss rate • straightforward data location • simple coherence (no replication) • long average hit latency

  31. 150% 141% Performance improvement Over shared caching Performance

More Related