220 likes | 470 Views
Increasing TLB Reach by Exploiting Clustering in Page Translations. Binh Pham § , Abhishek Bhattacharjee § , Yasuko Eckert ǂ , Gabriel H. Loh ǂ § Rutgers University ǂ AMD Research. Address Translation Overview. Address Generation. TLB. Page Table Walker. VA.
E N D
Increasing TLB Reach by Exploiting Clustering in Page Translations Binh Pham§, AbhishekBhattacharjee§, Yasuko Eckertǂ, Gabriel H. Lohǂ §Rutgers University ǂAMD Research Binh Pham - Rutgers University
Address Translation Overview Address Generation TLB Page Table Walker VA X86: Four-Level Page Tables in Memory Address Translation Time PTE PTE PTE PTE PA Cache Access VA: Virtual Address PA: Physical Address PTE: Page Table Entry Binh Pham - Rutgers University
Address Translation Performance Impact • Address translation performance overhead – 10-15% • Clark & Emer [Trans. On Comp. Sys. 1985] • Talluri & Hill [ASPLOS 1994] • Barr, Cox & Rixner[ISCA 2011] • Emerging software trends • Virtualization – up to 89% overhead [Bhargava et al., ASPLOS 2008] • Big Memory workloads – up to 50% overhead [Basu et al., ISCA 2013] • Emerging hardware trends • LLC capacity to TLB capacity ratios increasing • Manycore/hyperthreadingincreases TLB and LLC PTE stress Binh Pham - Rutgers University
TLB Miss Elimination Approaches • Increasing TLB size? • Latency • Power • Increasing TLB reach • Using large pages • Using “CoLT: Coalesced Large-Reach TLBs” (Pham et al., MICRO 2012) Binh Pham - Rutgers University
Contiguous Locality in Page Tables Page Table CoLT TLB Sequential Groups Holes Singletons OoO Binh Pham - Rutgers University
Clustered Locality in Page Tables Page Table Clustering Holes Clustered Groups OoO • Clustered locality can deal with “holes” between PTEs • Clusteredlocality does NOT care about PTEs’ order Binh Pham - Rutgers University
Spatial Locality Characterization Page Table % PTEs in the same group group len PTEs Distribution • Clustered locality is abundant and surpasses contiguous locality • Clusteredlocality increases with clustered spatial region size Binh Pham - Rutgers University
Outline • How do we exploit clustered locality in hardware? • How much can our design improve performance? • Conclusion Binh Pham - Rutgers University
Clustered TLB: Miss and Fill Page Table Page Table Walker 64B cacheline 8B PTE Coalescing Logic Clustered TLB Entry Sub-entries Binh Pham - Rutgers University
Clustered TLB Look Up Base VPN VPN(2:0) 0 011 Clustered TLB Base V Base P Sub-entries 0 01 000 001 010 100 100 101 X X X =? PPN lower bits Sub-entry hit? concat PPN = 12 (01100) Hit? Binh Pham - Rutgers University
Multi Granular TLB Design L1-TLB VPN cacheline L2-TLB Clustered TLB C0 TLB Coalescing Logic PPN PPN Clustered hit Base VPN Base PPN 8B PTE C0 hit Clustered-TLB entry TLB hit PPN N len >= Θ Y Binh Pham - Rutgers University
Methodology • Workloads: SPEC CPU2006, Cloudsuite, Server • Full System Simulation: • Baseline: 64-entry L1 ITLB, 64-entry L1 DTLB, 512-entry L2 TLB • Roughly equal hardware for baseline, CoLT, and MG-TLB L1-TLB L1-TLB L1-TLB L2-TLB CoLT-TLB MG-TLB Baseline CoLT MG-TLB Binh Pham - Rutgers University
Miss Elimination Best design gives 7% performance improvement on average -123% Binh Pham - Rutgers University
Insert to Clustered TLB or C0 TLB? Θ = 2 gives best performance: -51% -218% C0 Entry C0 Entry Cluster3 Entry Binh Pham - Rutgers University
Prefetching versus Capacity Benefit -32% -21% MG-TLB combines prefetching and capacity to get best performance Binh Pham - Rutgers University
Conclusion • We observe more generic type of locality (clustered locality) in the page translations • Multi-granular TLB • Eliminates nearly half of TLB misses • Our approach requires no OS modification, and provides robust performance gain Binh Pham - Rutgers University
Thanks for listening! Questions? Binh Pham - Rutgers University