190 likes | 333 Views
Shared Last-Level TLBs for Chip Multiprocessors. Abhishek Bhattacharjee Daniel Lustig Margaret Martonosi HPCA 2011. Presented by: Apostolos Kotsiolis CS 7123 – Research Seminar. Translation Lookaside Buffer. Contribution. SLL TLB design explored for the first time
E N D
Shared Last-Level TLBs for Chip Multiprocessors AbhishekBhattacharjee Daniel Lustig Margaret Martonosi HPCA 2011 Presented by: Apostolos Kotsiolis CS 7123 – Research Seminar
Contribution • SLL TLB design explored for the first time • Analyze SLL TLB benefits for parallel programs • Analyze multi-programmed fashion workloads consisting of sequential applications
Previous and Related work • Private Multilevel TLB Hierarchies • Intel i7, AMD K7-K8-K10, SPARC64-III • No Sharing between cores • Waste of resources • Inter-Core Cooperative Prefetching • Two types of predictable misses: • Inter-Core Shared (ICS) • Leader-Follower Prefetching • Inter-Core Predictable Stride (ICPS) • Distance-Based Cross-Core Prefetching
Shared Last-Level TLBs • Exploit inter-core sharing in parallel programs • Flexible regarding where entries can be placed • Both parallel and sequential workloads are benefited • Greater Hit rate • CPU Performance boosted
Methodology • Parallel applications • Different Sequential application on each core • Two distinct evaluation sets
Methodology • Benchmarks
SLL TLBs: Parallel Workload Results • SLL TLBs versus Private L2 TLBs
SLL TLBs: Parallel Workload Results • SLL TLBs versus ICC Prefetching
SLL TLBs: Parallel Workload Results • SLL TLBs versus ICC Prefetching
SLL TLBs: Parallel Workload Results • SLL TLBs with Simple Stride Prefetching
SLL TLBs: Parallel Workload Results • SLL TLBs at Higher Core Counts
SLL TLBs: Parallel Workload Results • Performance Analysis
SLL TLBs: Multiprogrammed Workload Results • Multiprogrammed Workloads with One Application Pinned per Core
SLL TLBs: Multiprogrammed Workload Results • Performance Analysis
Conclusion-Benefits: • On Parallel Workloads: • Elimination of 7-79% of L1 TLBs misses exploiting parallel program inter-core sharing • Outperform conventional per-core private L2 TLBs by average of 27% • Improve CPI up to 0.25 • On multiprogrammed sequential workloads: • Improve over private L2 TLBs by average of 21% • Improve CPI up to 0.4