420 likes | 605 Views
Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. ISCA 2019. Debashis Ganguly Ziyu Zhang, Jun Yang, Rami Melhem. Why do we need Hardware Prefetchers?. Kernel Execution. Far fault. Data Migration. Why do we need Hardware Prefetchers?.
E N D
Interplay between Hardware Prefetcher and Page Eviction Policyin CPU-GPU Unified Virtual Memory ISCA 2019 Debashis GangulyZiyu Zhang, Jun Yang, Rami Melhem
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch • What and when to prefetch? • How do I synchronize between streams?
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch • What and when to prefetch? • How do I synchronize between streams? Hardware Prefetch
Why do we need Hardware Prefetchers? Kernel Execution Far fault Data Migration Stream 0 Stream 1 User Directed Prefetch • What and when to prefetch? • How do I synchronize between streams? • Takes away the programming effort • Follows spatio-temporal locality of past accesses • Overlap kernel execution and data migration Hardware Prefetch
Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs
Different Hardware Prefetchers Random Prefetcher (Rp) 2MB 2MB 2MB Randomly prefetch a 4KB page local to the 2MB large page to which the current faulty page belongs Sequential-local 64KB Prefetcher (SLp) [Variation of Sequential and Locality-aware] 2MB 2MB 2MB • 64KB • 64KB 64KB Prefetch 64KB basic block corresponding to which the current faulty page belongs
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 12.5% 0% 25% 0% 0% 0% 50% 0% 100% 0% 0% 0% 0% 0% 0% 0% 4K 60K 1 1
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 25% 50% 0% 50% 0% 0% 50% 100% 0% 0% 100% 0% 0% 0% 0% 0% 0% 4K 60K 4K 60K 2 1 2 1
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K Invalid Page Access Far fault Prefetch 37.5% 0% 75% 100% 50% 0% 0% 100% 100% 0% 0% 100% 0% 0% 0% 0% 0% 0% 4K 60K 4K 60K 4K 60K 2 1 3 2 1 3
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 4 Invalid Page Access Far fault Prefetch 50% 0% 100% 100% 100% 0% 0% 100% 100% 100% 0% 100% 0% 0% 0% 0% 0% 0% 0% 4K 60K 4K 60K 4K 60K 2 1 3 2 1 3
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 4 Invalid Page Access Far fault Prefetch 62.5% 25% 100% 100% 100% 0% 50% 100% 100% 100% 0% 100% 0% 0% 0% 0% 100% 0% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 4 1 3 2 5 1 3
Tree-based Neighborhood Prefetcher (TBNp) 64K 64K 64K 64K 64K 5 5 4 5 Invalid Page Access Far fault Prefetch 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 100% 0% 0% 0% 100% 0% 0% 0% 100% 100% 100% 4K 60K 4K 60K 4K 60K 4K 60K 2 4 1 3 2 5 1 3
When working set fits in device memory • Larger the transfer size, higher the bandwidth • Reduced number of far-faults TBNp has 1-2 order of magnitude performance improvement over no prefetching
What happens under device memory oversubscription? • Disable hardware prefetchers • To avoid displacement of heavily referenced pages • Pre-eviction to maintain free-page buffer • To avoid write-back latency Early disabling of prefetcher by pre-eviction ~100x performance degradation with just 110% oversubscription
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • No contiguous free space to prefetch • Renders prefetcher ineffective
Interplay between Prefetcher and Naïve Eviction Policies LRU 4KB LRU 2MB 2MB 2MB 2MB 2MB 2MB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB • Displace heavily referenced pages • Causes large thrashing • No contiguous free space to prefetch • Renders prefetcher ineffective
Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space
Prefetcher Inspired Eviction Policies Random Eviction (Re) 2MB 2MB 2MB Randomly evict a 4KB page from the entire virtual address space Sequential-local 64KB Pre-eviction (SLe) 2MB 2MB 2MB • 64KB • 64KB 64KB Pre-evict 64KB basic block corresponding to the 4KB LRU candidate
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 87.5% 100% 75% 100% 100% 100% 50% 100% 100% 100% 100% 100% 100% 100% 100% 0% 4K 60K 1 1
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 75% 100% 50% 100% 100% 50% 50% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 4K 60K 4K 60K 2 1 2 1
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 62.5% 75% 50% 100% 50% 50% 50% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 4K 60K 4K 60K 4K 60K 2 3 1 2 3 1
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K Valid LRU Candidate LRU Eviction Pre-eviction 50% 75% 25% 100% 50% 50% 0% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 3 1 4 2 3 1 4
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 5 Valid LRU Candidate LRU Eviction Pre-eviction 37.5% 75% 0% 100% 50% 0% 0% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 3 1 4 2 3 1 4
Tree-based Neighborhood Pre-eviction (TBNe) 64K 64K 64K 64K 6 6 6 5 Valid LRU Candidate LRU Eviction Pre-eviction 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100% 100% 100% 100% 100% 100% 100% 100% 0% 0% 0% 4K 60K 4K 60K 4K 60K 4K 60K 2 3 1 4 2 3 1 4
Combining Pre-evictions (4KB Granularity) and Prefetchers • No additional co-ordination required • Respecting each other pays off Order of magnitude performance improvement by TBNp and TBNe combo
Combining Pre-evictions (2MB Granularity) and Prefetchers • Dynamic eviction granularity • Reduced number of thrashing Average 18.5% performance improvement by TBNe
Conclusion • Leverages the framework for hardware prefetcher • No additional implementation and performance overhead • Builds on generic concepts • Vendor agnostic • Opportunistically decide on dynamic eviction granularity • Navigates between two extremes: 4KB and 2MB • Overcomes limitations with static granularity • Micro-benchmarks, UVM benchmarks, and simulator • Public for future collaboration • https://github.com/DebashisGanguly/gpgpu-sim_UVMSmart
Interplay between Hardware Prefetcher and Page Eviction Policyin CPU-GPU Unified Virtual Memory Debashis GangulyPh.D. Student • debashis@cs.pitt.edu • https://people.cs.pitt.edu/~debashis/