220 likes | 462 Views
Many-Thread Aware Prefetching Mechanisms for GPGPU Application. Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu. In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010. Paper presentation by
E N D
Many-Thread Aware Prefetching Mechanisms for GPGPU Application Jaekyu Lee Nagesh B. LakshminarayanaHyesoon Kim Richard Vudu In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010 Paper presentation by SankalpShivaprakash
Motivation • Memory latency hiding through multithread prefetching schemes • Per-warp training and Stride promotion • Inter-thread Prefetching • Adaptive Throttling • Propose software and hardware prefetching mechanisms for a GPGPU architecture • Scalable to large number of threads • Robustness through feedback and throttling mechanisms to avoid degraded performance
Memory Latency Hiding techniques • Multithreading • Thread level and Warp level context switching • Utilization of complex cache memory hierarchies • Using L1, L2, DRAMs than accessing Global Memory each time • Prefetching • Insufficient thread-level parallelism • Memory request merging Thread1 Thread2 Thread1 Thread3
Prefetching – Parallel Architectures • Reason for prefetching: Consider warp1 and warp2 having three instructions(Add, Sub, Load) • Without prefetch: • With prefetch: • Prefetch1: Fetching for Load2 • Prefetch2: Fetching for Load3 Warp2 Warp1 Load1 for Warp1 Idle-Load2 for Warp2 Warp2 Warp1 Warp3 Load1 for Warp1
Prefetching (Contd) • Software Prefetching • Prefetching into Registers • Prefetching into Cache • Congestion in Cache if not controlled and accurate • Data could get polluted
Prefetching (Contd) • Hardware Prefetching • Stream Prefetcher • Monitors the direction of access in a memory region • Once a constant access direction is detected, launch prefetches in that direction • Stride Prefetcher • Tracks the difference in address between two accesses • Launches prefetch requests using the delta once a constant difference is detected • GHB Prefetcher (Global History Buffer) • Stores miss addresses in an n-entry FIFO table(GHB table) • Each miss address points to another entry(right) which can detect stream, stride and irregular repeating address patterns *Characterize Aggressiveness 0 1000 1000 δ= 1000 2000 1000
Many-Thread aware prefetching MT-SWP • Conventional Stride Prefetching • Inter-thread Prefetching(IP)
Many-Thread aware prefetching MT-HWP Scalable versions of the traditional training policies, for PC based stride prefetchers • Per warp training • Strong stride behavior exists within a warp • Stride information trained per warp is stored in a PWS (Per Warp Stride) Table
Many-Thread aware prefetching MT-HWP • Stride Promotion • Considering the stride pattern is the same across all warps for a given PC, PWS is monitored for three accesses • If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS • Inter-thread Prefetching • Monitor stride pattern across threads at the same PC, for 3 memory accesses • If found same, stride information is stored in the IP table
Many-Thread aware prefetching MT-HWP • Implementation • When there are hits in both GS and IP, GS is given preference because • Strides within warp are more common than those across warps • Trained for a longer period
Useful vs. Harmful Prefetching • MTAML-Minimum Tolerable Average Memory Latency • Minimum average number of cycles per memory request that does not lead to stalls • MTAML_pref
Useful vs. Harmful Prefetching • Comparison of MTAML and measured average latency (AVG Latency) • 1. AVG Latency < MTAML & • AVG Latency(PREF)< MTAML_pref • 2. AVG Latency > MTAML: • Prefetching beneficial provided • AVG Latency (PREF) is less than MTAML_pref • 3. Prefetching might turn out useful/Harmful • Measured AVG Latency(PREF) ignores successively prefetched memory operations • Greater contention seen when the number of warps increase and delay increased 2 1 3
Useful vs. Harmful Prefetching • Harmful prefetch requests could be due to: • Queuing Delays • DRAM row-buffer conflicts • Wasting of off-chip bandwidth due to early eviction • Wasting of off-chip bandwidth due to inaccurate prefetches
Metrics for Adaptive Prefetch Throttling • Early Eviction Rate • Merge Ratio • Avoids : • Consumption of system bandwidth • Delay requests • Occupation of Cache by unnecessary prefetches Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps
Metrics for Adaptive Prefetch Throttling • Monitoring of Early Eviction and Merge Ratio
Methodology • Baseline processor used is NVIDIA’s 8800GT • Applications to simulator is generated using GPUOcelot, a binary translator framework for PTX
Conclusion • The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it • The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone • Scalability and robustness was given importance • The study does not consider complex cache memory hierarchies • Overhead of prefetching is not clearly substantiated