1 / 34

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. Introduction. SIMD Execution. Shared Memory. DRAM. Memory Req.uestBuffer. Core. General Purpose GPUs (GPGPU) are getting popular High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops )

loren
Download Presentation

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

  2. Introduction SIMD Execution Shared Memory DRAM Memory Req.uestBuffer Core Many-Thread Aware Prefetching Mechanisms (MICRO-43) • General Purpose GPUs (GPGPU) are getting popular • High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops) • GPGPUs have SIMD execution, many cores, and large-scale multi-threading • Warp – basic unit of execution in a core (SIMD unit)

  3. Memory Latency Problem C C C M M M C C C C C C M M M D D D Memory Latency Computation Memory Dependent on memory C M D No stall 4 active threads T0 C M C C M D C M C Switch T1 Switch T2 Switch T3 • Tolerating memory latency is critical in CPUs • Many techniques have been proposed • Caches, prefetching, multi-threading, etc. • Is this critical in GPGPUs as well? • GPGPUs have employed multi-threading Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  4. Memory Problems in GPGPUs 2 active threads C M C C M D C C M M C C C C M M D D T0 Switch Stall Cycles T1 Stall Memory Latency • What if there are not enough threads in GPGPUs? • Limited thread-level-parallelism • Application behavior • Algorithmically, lack of parallelism • Limited by resource constraints • # registers per thread, # threads per block, shared memory usage per block Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  5. Prefetching in GPGPUs Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem: when multi-threading is not enough, how can we hide memory latency? • Other solutions • Caching (NVIDIA Fermi) • Prefetching (in this talk) • Many prefetchers mechanisms proposed in CPUs • stride, stream, Markov, CDP, GHB, helper thread, etc. • Question: will the existing mechanisms work in GPGPUs?

  6. Characteristic #1. Many Threads 1 thread 2 threads Many threads Prefetcher Prefetcher Prefetcher Prefetching in CPU Prefetching in GPU Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #1. Training of prefetcher(Scalability) • Accesses from many threads are interleaved • Problem #2. Amplified negative effects (SIMT) • One useless prefetchrequest per thread many useless prefetches

  7. Characteristic #1. Many Threads Capacity misses pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref Fit in a cache Cache Cache Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #3. Single-Configuration Many-Threads (SCMT) • Too many threads are controlled together Prefetch degree 1: < cache size Prefetch degree 2: >> cache size

  8. Characteristic #2. Data Level Parallelism create prefetch Memory latency prefetch demand Memory latency demand Useful! Not enough opportunity 1. thread terminated 2. too close to demand terminate A thread in sequential program A thread in parallel program Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #4. Short thread lifetime • The length of a thread in parallel programs is shorter than in sequential programs due to the parallelization

  9. Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Design hardware/software prefetching mechanisms for GPGPU applications • Step 1. Prefetcher for Many-thread Architecture • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4) • Step 2. Feedback mechanism to reduce negative effects • Prefetch Throttling(Problems #2 and #3)

  10. Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Design hardware/software prefetching mechanisms for GPGPU applications • Step 1. Prefetcher for Many-thread Architecture • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4) • Step 2. Feedback mechanism to reduce negative effects • Prefetch Throttling(Problems #2 and #3)

  11. Many-Thread Aware Hardware Prefetcher PromotionTable Decision Logic PromotionTable Decision Logic PC, ADDR Pref. Addr IP Pref. PC, ADDR TID IP Pref. Stride Pref. PC, ADDR TID Stride Pref. Stride Promotion • (Conventional) Stride prefetcher • Promotion table for stride prefetcher (Problem #1) • Inter-Thread prefetcher (Problem #4) • Decision logic Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  12. Solving Scalability Problem Promotion Redundant Entries Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #1. Training of prefetcher (Scalability) • Stride Promotion • Similar (or even same) access pattern across threads • Without promotion, table is occupied by redundant entries • By promotion, we can effectively manage storage • Reduce training time using earlier threads’ information

  13. Solving Short Thread Lifetime Problem prefetch demand Memory latency for (ii = 0; ii < 100; ++ii) { prefetch(A[ii+1]); prefetch(B[ii+1]); C[ii] = A[ii] + B[ii]; } // there are 100 threads __global__ void KernelFunction(…) { inttid = blockDim.x * blockIdx.x + threadIdx.x; intvarA = aa[tid]; intvarB = bb[tid]; C[tid] = varA + varB; } Loop! No loop, 2 mem, 1 comp Many-Thread Aware Prefetching Mechanisms (MICRO-43) Problem #4 (Short thread lifetime) Highly parallelized code often eliminates prefetching opportunities

  14. Inter-Thread Prefetching Prefetch // there are 100 threads __global__ void KernelFunction(…) { inttid = blockDim.x * blockIdx.x + threadIdx.x; intnext_tid = tid + 32; prefetch(aa[next_tid]); prefetch(bb[next_tid]); intvarA = aa[tid]; intvarB = bb[tid]; C[tid] = varA + varB; } T0 T0 T1 T1 T2 T2 … Prefetch T32 T32 T33 T33 … … … … prefetch Memory access in other threads prefetch T64 T64 T64 • Instead, we can prefetch for other threads • Inter-Thread Prefetching (IP) • In CUDA, Memory addresses = func(thread id) [SIMT] Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  15. IP Pattern Detection in Hardware Req 4 Req 3 Req 2 Req 1 PC:0x1a Addr:2100 TID:1 PC:0x1a Addr:400 TID:3 PC:0x1a Addr:1100 TID:10 PC:0x1a Addr:200 TID:1 Prefetch (addr + stride) Addr:2100 Stride:100 Trained already Addr ∆ Delta (Req1-Req2) = = 100 All three deltas are same We found a pattern TID ∆ Delta (Req3-Req1) = = 100 Delta (Req3-Reg2) = = 100 Detecting strides across threads Launch prefetch request Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  16. MT-aware Hardware Prefetcher PromotionTable Decision Logic PC, ADDR Pref. Addr PC, ADDR TID IP Pref. Cycle 2 Cycle 3 PC, ADDR TID Stride Pref. Cycle 1 Stride Promotion Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  17. Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Design a hardware/software prefetcher for GPGPU applications • Step 1. Prefetcher for Many-thread Architecture • Many-Thread Aware Prefetching Mechanisms • Step 2. Feedback mechanism to reduce negative effects • Prefetch Throttling

  18. Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion

  19. Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Recall problems in GPGPU prefetching • Problem #2. Amplifying negative effects • Problem #3. Single-Configuration Many-Thread • In order to identify whether prefetching is effective • Metrics • Usefulness – Accurate and timely • Harmfulness – Inaccurate or too early prefetches • Some late prefetches can be tolerable • Similar to Srinath [HPCA 2007]

  20. Throttling Metrics Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Merged memory requests • New request with same address of existing entries • Inside of a core (in MSHR) • Late prefetches in CPUs • Indicate accuracy (due to massive multi-threading) • Less correlated with timeliness • Early block eviction from a prefetch cache • Due to capacity misses, regardless of accuracy • Indicate timeliness and accuracy • Periodic Updates • To cope with runtime behavior

  21. Heuristic for Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) * Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio. • Throttle Degree • Vary from 0 (prefetch all) to 5 (no prefetch) • Default:2

  22. Evaluation Methodology Many-Thread Aware Prefetching Mechanisms (MICRO-43) • MacSim simulator • A cycle accurate, in-house simulator • A trace-driven simulator (trace from GPUOcelot[Diamos]) • Baseline • 14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels, 1.2GHz memory frequency, 900MHz bus, FR-FCFS • NVIDIA G80 Architecture • 14 memory intensive benchmarks • CUDA SDK, Merge, Rodinia, and Parboil • Type • Stride, MP (massively parallel), uncoalesced • Non-memory intensive benchmarks (in the paper)

  23. Evaluation Methodology – cont’d Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Prefetch • Stream, Stride, and GHB prefetchers evaluated • 16 KB cache per core (other size results are in the paper) • Prefetch distance:1 degree :1 (the optimal configuration) • Results • Hardware prefetcher • Software prefether (in the paper)

  24. MT Hardware Prefetcher Results 15% over Stride Many-Thread Aware Prefetching Mechanisms (MICRO-43) GHB/Stride do not work in mp and uncoal-type IP (Inter-Thread Prefetching) can be effective Stride Promotion improves performance of few benchmarks

  25. MT-HWP with Throttling Results 15% over Stride + Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Feedback-driven hardware prefetchers can be effective • Throttling eliminates negative effect (stream) * There are more negative cases in software prefetching mechanism

  26. Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion

  27. Conclusion Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Memory is also an important problem in GPGPUs. • GPGPU prefetching has four problems: • scalability, amplifying negative effects, SCMT, and short thread • Goal: Design hardware/software prefetcher • Step 1. Many-Thread aware prefetcher(promotion, IP) • Step 2. Prefetch throttling • MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects. • Future work • Study other many-thread architectures. • Other programming models, architectures with caches

  28. Many-Thread Aware Prefetching Mechanisms (MICRO-43) THANK YOU!

  29. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

  30. NVIDIA Fermi Result Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  31. Different Prefetch Cache Size Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  32. Software MT Prefetcher Results Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  33. Many-Thread Aware Prefetching Mechanisms (MICRO-43)

  34. Many-Thread Aware Prefetching Mechanisms (MICRO-43)

More Related