1 / 17

Increasing TLB Reach by Exploiting Clustering in Page Translations

Increasing TLB Reach by Exploiting Clustering in Page Translations. Binh Pham § , Abhishek Bhattacharjee § , Yasuko Eckert ǂ , Gabriel H. Loh ǂ § Rutgers University ǂ AMD Research. Address Translation Overview. Address Generation. TLB. Page Table Walker. VA.

gezana
Download Presentation

Increasing TLB Reach by Exploiting Clustering in Page Translations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Increasing TLB Reach by Exploiting Clustering in Page Translations Binh Pham§, AbhishekBhattacharjee§, Yasuko Eckertǂ, Gabriel H. Lohǂ §Rutgers University ǂAMD Research Binh Pham - Rutgers University

  2. Address Translation Overview Address Generation TLB Page Table Walker VA X86: Four-Level Page Tables in Memory Address Translation Time PTE PTE PTE PTE PA Cache Access VA: Virtual Address PA: Physical Address PTE: Page Table Entry Binh Pham - Rutgers University

  3. Address Translation Performance Impact • Address translation performance overhead – 10-15% • Clark & Emer [Trans. On Comp. Sys. 1985] • Talluri & Hill [ASPLOS 1994] • Barr, Cox & Rixner[ISCA 2011] • Emerging software trends • Virtualization – up to 89% overhead [Bhargava et al., ASPLOS 2008] • Big Memory workloads – up to 50% overhead [Basu et al., ISCA 2013] • Emerging hardware trends • LLC capacity to TLB capacity ratios increasing • Manycore/hyperthreadingincreases TLB and LLC PTE stress Binh Pham - Rutgers University

  4. TLB Miss Elimination Approaches • Increasing TLB size? • Latency • Power • Increasing TLB reach • Using large pages • Using “CoLT: Coalesced Large-Reach TLBs” (Pham et al., MICRO 2012) Binh Pham - Rutgers University

  5. Contiguous Locality in Page Tables Page Table CoLT TLB Sequential Groups Holes Singletons OoO Binh Pham - Rutgers University

  6. Clustered Locality in Page Tables Page Table Clustering Holes Clustered Groups OoO • Clustered locality can deal with “holes” between PTEs • Clusteredlocality does NOT care about PTEs’ order Binh Pham - Rutgers University

  7. Spatial Locality Characterization Page Table % PTEs in the same group group len PTEs Distribution • Clustered locality is abundant and surpasses contiguous locality • Clusteredlocality increases with clustered spatial region size Binh Pham - Rutgers University

  8. Outline • How do we exploit clustered locality in hardware? • How much can our design improve performance? • Conclusion Binh Pham - Rutgers University

  9. Clustered TLB: Miss and Fill Page Table Page Table Walker 64B cacheline 8B PTE Coalescing Logic Clustered TLB Entry Sub-entries Binh Pham - Rutgers University

  10. Clustered TLB Look Up Base VPN VPN(2:0) 0 011 Clustered TLB Base V Base P Sub-entries 0 01 000 001 010 100 100 101 X X X =? PPN lower bits Sub-entry hit? concat PPN = 12 (01100) Hit? Binh Pham - Rutgers University

  11. Multi Granular TLB Design L1-TLB VPN cacheline L2-TLB Clustered TLB C0 TLB Coalescing Logic PPN PPN Clustered hit Base VPN Base PPN 8B PTE C0 hit Clustered-TLB entry TLB hit PPN N len >= Θ Y Binh Pham - Rutgers University

  12. Methodology • Workloads: SPEC CPU2006, Cloudsuite, Server • Full System Simulation: • Baseline: 64-entry L1 ITLB, 64-entry L1 DTLB, 512-entry L2 TLB • Roughly equal hardware for baseline, CoLT, and MG-TLB L1-TLB L1-TLB L1-TLB L2-TLB CoLT-TLB MG-TLB Baseline CoLT MG-TLB Binh Pham - Rutgers University

  13. Miss Elimination Best design gives 7% performance improvement on average -123% Binh Pham - Rutgers University

  14. Insert to Clustered TLB or C0 TLB? Θ = 2 gives best performance: -51% -218% C0 Entry C0 Entry Cluster3 Entry Binh Pham - Rutgers University

  15. Prefetching versus Capacity Benefit -32% -21% MG-TLB combines prefetching and capacity to get best performance Binh Pham - Rutgers University

  16. Conclusion • We observe more generic type of locality (clustered locality) in the page translations • Multi-granular TLB • Eliminates nearly half of TLB misses • Our approach requires no OS modification, and provides robust performance gain Binh Pham - Rutgers University

  17. Thanks for listening! Questions? Binh Pham - Rutgers University

More Related