1 / 92

Optimizing GPGPU Workloads with CAWA Approach

This study presents the CAWA method to accelerate GPGPU workloads by prioritizing critical warps and improving warp scheduling. It aims to reduce warp divergence, enhance throughput, manage resources efficiently, and mitigate workload imbalances caused by branch divergence and memory system impacts. The integration of criticality prediction, cache prioritization, and warp scheduling strategies into GPU microarchitecture leads to improved performance by reducing warp divergence and latency issues.

Download Presentation

Optimizing GPGPU Workloads with CAWA Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU WorkloadsS. –Y Lee, A. A. Kumar and C. J WuISCA 2015

  2. Goal • Reduce warp divergence and hence increase throughput • The key is the identification of critical (lagging) warps • Manage resources and scheduling decisions to speed up the execution of critical warps thereby reducing divergence

  3. Review: Resource Limits on Occupancy Thread Block Control Kernel Distributor Limits the #thread blocks TB 0 SM Scheduler Warp Schedulers Limits the #threads Warp Context SM Warp Context SM SM SM Warp Context Warp Context SP SP SP SP SP SP SP SP DRAM SP SP SP SP SP SP SP SP Limits the #threads Register File SM – Stream Multiprocessor Limits the #thread blocks SP – Stream Processor L1/Shared Memory Locality effects

  4. Evolution of Warps in TB • Coupled lifetimes of warps in a TB • Start at the same time • Synchronization barriers • Kernel exit (implicit synchronization barrier) Completed warps Region where latency hiding is less effective Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

  5. Warp Execution Time Disparity • Branch divergence, interference in the memory system, scheduling policy, and workload imbalance Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

  6. Workload Imbalance 10000 instructions 10 instructions • Imbalance exists even without control divergence

  7. Branch Divergence Warp-to-warp variation in dynamic instruction count (w/o branch divergence) Intra-warp branch divergence Example: traversal over constant node degree graphs

  8. Branch Divergence Extent of serialization over a CFG traversal varies across warps

  9. Impact of the Memory System Executing Warps • Could have been re-used • Too slow! Intra-wavefront footprints in the cache Scheduling gaps are greater than the re-use distance Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

  10. Impact of Warp Scheduler • Amplifies the critical warp effect Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

  11. Criticality Predictor Logic n instructions m instructions • Non-divergent branches can generate large differences in dynamic instruction counts across warps (e.g., m>>n) • Update CPL counter on branch: estimate dynamic instruction count • Update CPL counter on instruction commit

  12. Warp Criticality Problem Completed (idle) warp contexts The last warp Warp Registers allocated to completed warps in the TB: temporal underutilization Warp Context Available registers: spatial underutilization Warp Temporal & spatial underutilization TB Manage resources and schedules around Critical Warps

  13. CPL Calculation n instructions m instructions Average instruction Disparity between warps Average warp CPI Inter-instruction memory stall cycles

  14. Scheduling Policy • Select warp based on criticality • Execute until no more instructions are available • A form of GTO • Critical warps get higher priority and large time slice

  15. Behavior of Critical Warp References Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

  16. Criticality Aware Cache Prioritization • Prediction: critical warp • Prediction: re-reference interval • Both used to manage the cache foot print Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

  17. Integration into GPU Microarchitecture Criticality Prediction Cache Management Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015

  18. Performance Periodic computation of accuracy for critical warps Due in part to low miss rates

  19. Summary • Warp divergence leads to some lagging warps  critical warps • Expose the performance impact of critical warps  throughput reduction • Coordinate scheduler and cache management to reduce warp divergence

  20. Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. AamodtMICRO 2012

  21. Goal • Understand the relationship between schedulers (warp/wavefront) and locality behaviors • Distinguish between inter-wavefront and intra-wavefront locality • Design a scheduler to match #scheduled wavefronts with the L1 cache size • Working set of the wavefronts fits in the cache • Emphasis on intra-wavefront locality

  22. Reference Locality • Scheduler decisions can affect locality • Need non-oblivious (to locality) schedulers Greatest gain! Intra-thread locality Inter-thread locality Intra-wavefront locality Inter-wavefront locality Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  23. Key Idea Ready Wavefronts Executing Wavefronts Intra-wavefront footprints in the cache Issue? • Impact of issuing new wavefronts on the intra-warp locality of executing wavefronts • Footprint in the cache • When will a new wavefront cause thrashing?

  24. Concurrency vs. Cache Behavior Round Robin Scheduler Less than max concurrency Adding more wavefronts to hide latency traded off against creating more long latency memory references due to thrashing Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  25. Additions to the GPU Microarchitecture I-Fetch I-Buffer Issue Control issue of wavefronts Decode RF PRF Feedback scalar pipeline scalar pipeline scalar Pipeline pending warps Keep track of locality behavior on a per wavefront basis D-Cache All Hit? Data Writeback

  26. Intra-Wavefront Locality • Intra-thread locality leads to intra-wavefront locality • #wavefronts determined by its reference footprint and cache size • Scheduler attempts to find the right #wavefronts • Scalar thread traverses edges • Edges stored in successive memory locations • One thread vs. many

  27. Static Wavefront Limiting I-Fetch • Limit the #wavefronts based on working set and cache size • Based on profiling? • Current schemes allocate #wavefronts based on resource consumption not effectiveness of utilization • Seek to shorten the re-reference interval I-Buffer Issue Decode RF PRF scalar pipeline scalar pipeline scalar Pipeline pending warps D-Cache All Hit? Data Writeback

  28. Cache Conscious Wavefront Scheduling • Basic Steps • Keep track of lost locality of a wavefront • Control number of wavefronts that can be issued • Approach • List of wavefronts sorted by lost locality score • Stop issuing wavefronts with least lost locality I-Fetch I-Buffer Issue Decode RF PRF scalar pipeline scalar pipeline scalar Pipeline pending warps D-Cache All Hit? Data Writeback

  29. Keeping Track of Locality I-Fetch Keep track of associated Wavefronts I-Buffer Issue Decode RF PRF Update lost locality score scalar pipeline scalar pipeline scalar Pipeline D-Cache All Hit? Data pending warps Indexed by WID Victims indicate “lost locality Writeback Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  30. Updating Estimates of Locality I-Fetch Changes based on #wavefronts I-Buffer Issue Decode RF PRF Wavefronts sorted by lost locality score scalar pipeline scalar pipeline scalar Pipeline Note which are allowed to issue Update wavefront lost locality score on victim hit D-Cache All Hit? pending warps Data Writeback Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  31. Estimating the Feedback Gain Prevent issue VYA hit – increase score Drop the scores – no VTA hits Scale gain based on percentage of cutoff Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  32. Schedulers • Loose Round Robin • Greedy-Then-Oldest (GTO) • Execute one wavefront until stall and then oldest • 2LVL-GTO • Two level scheduler with GTO instead of RR • Best SWL • CCWS

  33. Performance Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  34. Some General Observations • Performance largely determined by • Emphasis on oldest wavefronts • Distribution of references across cache lines – extent of lack of coalescing • GTO works well for this reason  prioritizes older wavefronts • LRR touches too much data (too little reuse) to fit in the cache

  35. Locality Behaviors Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012

  36. Summary • Dynamic tracking of relationship between wavefronts and working sets in the cache • Modify scheduling decisions to minimize interference in the cache • Tunables: need profile information to create stable operation of the feedback control

  37. Divergence Aware Warp Scheduling T. Rogers, M O’Conner, and T. AamodtMICRO 2013

  38. Goal • Understand the relationship between schedulers (warp/wavefront) and control & memory locality behaviors • Distinguish between inter-wavefront and intra-wavefront locality • Design a scheduler to match #scheduled wavefronts with the L1 cache size • Working set of the wavefronts fits in the cache • Emphasis on intra-wavefront locality • Couple effects of control flow divergence • Differs from CCWS in being proactive • Deeper look at what happens inside loops

  39. Key Idea • Manage the relationship between control divergence, memory divergence and scheduling Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  40. Key Idea (2) • Coupling control divergence and memory divergence • The former indicates reduced cache capacity • “learning” eviction patterns

  41. Key Idea (2) warp 0 warp 1 while (i <C[tid+1]) Intra-thread locality Fill the cache with 4 references – delay warp 1 Divergent branch Available room in the cache due to divergence in warp 0  schedule warp 1 Use warp 0 behavior to predict interference due to warp 1 Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  42. Goal Simpler portable version • Each thread computes a row • Structured similar to a multithreaded CPU version • Desirable Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  43. Goal GPU-Optimized Version • Each warp handles a row • Renumber threads modulo warp size • Row element index for each thread • Warps traverse the whole row • Each thread handles a few row (column elements) • Need to get sum the partial sums computed by each thread in a row Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  44. Optimizing the Vector Version • What do need to know? • TB & warp sizes, grid mappings • Tuning TB sizes as a function of machine size • Orchestrate the partial product summation Simplicity (productivity) with no sacrifice in performance

  45. Goal GPU-Optimized Version Simpler portable version Make the performance equivalent Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  46. Additions to the GPU Microarchitecture I-Fetch I-Buffer/Sboard Issue • Control issue of warps • Implements PDOM for control divergence Decode RF PRF scalar pipeline scalar pipeline scalar Pipeline Memory Coalescer pending warps D-Cache All Hit? Data Writeback

  47. Observation Loops • Bulk of the accesses in a loop come from a few static load instructions • Bulk of the locality in (these) applications is intra-loop

  48. Distribution of Locality Hint: Can we keep data from last iteration? • Bulk of the locality comes form a few static loads in loops • Find temporal reuse Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  49. A Solution • Prediction mechanisms for locality across iterations of a loop • Schedule such that data fetched in one iteration is still present at next iteration • Combine with control flow divergence (how much of the footprint needs to be in the cache?) Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

  50. Classification of Dynamic Loads • Group static loads into equivalence classes  reference the same cache line • Identify these groups by repetition ID • Prediction for each load by compiler or hardware Diverged Control flow divergence Somewhat diverged converged Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

More Related