1 / 29

Tuning Stencils

Tuning Stencils. Kaushik Datta Microsoft Site Visit April 29, 2008. Stencil Code Overview. For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)

Download Presentation

Tuning Stencils

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tuning Stencils Kaushik Datta Microsoft Site Visit April 29, 2008

  2. Stencil Code Overview • For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) • A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”) 2D Stencil 3D Stencil

  3. Stencil Applications • Stencils are critical to many scientific applications: • Diffusion, Electromagnetics, Computational Fluid Dynamics • Both uniform and adaptive block-structured meshes • Many type of stencils • 1D, 2D, 3D meshes • Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) • Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes) • Varying boundary conditions (constant vs. periodic)

  4. Naïve Stencil Code void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } } } }

  5. Our Stencil Code • Executes a 3D, 7-point, Jacobi iteration on a 2563 grid • Performs 8 flops (6 adds, 2 mults) per point • Parallelization performed with pthreads • Thread affinity: multithreading, then multicore, then multisocket • Flop:Byte Ratio • 0.33 (write allocate architectures) • 0.5 (Ideal)

  6. Cache-Based Architectures Intel Clovertown AMD Barcelona Sun Victoria Falls

  7. Autotuning • Provides a portable and effective method for tuning • Limiting the search space: • Searching the entire space is intractable • Instead, we ordered the optimizations appropriately for a given platform • To find best parameters for a given optimization, performed exhaustive search • Each optimization was applied on top of all previous optimizations • In general, can also use heuristics/models to prune search space

  8. Naive Code x z (unit-stride) y • Naïve code is a simple, threaded stencil kernel • Domain partitioning performed only in least contiguous dimension • No optimizations or tuning was performed

  9. Naïve Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  10. NUMA-Aware Intel Clovertown AMD Barcelona Sun Victoria Falls • Exploited “first-touch” page mapping policy on NUMA architectures • Due to our affinity policy, benefit only seen when using both sockets

  11. NUMA-Aware +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  12. Loop Unrolling/Reordering • Allows for better use of registers and functional units • Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF) • should eliminate any effects from memory subsystem • This optimization is independent of later memory optimizations

  13. Loop Unrolling/Reordering +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  14. Padding x z (unit-stride) y • Used to reduce conflict misses and DRAM bank conflicts • Drawback: Larger memory footprint • Performed search to determine best padding amount • Only padded in unit-stride dimension Padding Amount

  15. Padding +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  16. Thread/Cache Blocking x z (unit-stride) y • Performed exhaustive search over all possible power-of-two parameter values • Every thread block is the same size and shape • Preserves load balancing • Did NOT cut in contiguous dimension on x86 machines • Avoids interrupting HW prefetchers • Only performed cache blocking in one dimension • Sufficient to fit three read planes and one write plane into cache Thread Blocks in x: 4 Thread Blocks in y: 2 Thread Blocks in z: 2 Cache Blocks in y: 2

  17. Thread/Cache Blocking +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  18. Software Prefetching • Allows us to hide memory latency • Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)

  19. Software Prefetching +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  20. SIMDization • Requires complete code rewrite to utilize 128-bit SSE registers • Allows single instruction to add/multiply two doubles • Only possible on the x86 machines • Padding performed to achieve proper data alignment (not to avoid conflicts) • Searched over register block sizes and prefetch distances simultaneously

  21. SIMDization +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  22. Cache Bypass • Writes data directly to write-back buffer • No data load on write miss • Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 • Reduces memory data traffic by 33% • Still requires the SIMDized code from the previous optimization • Searched over register block sizes and prefetch distances simultaneously

  23. Cache Bypass +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  24. Collaborative Threading z (unit-stride) y t0 t0 t0 t0 t0 t0 t0 t0 t2 t4 t2 t2 t2 t2 t4 t4 t2 t4 t4 t4 t4 t2 t4 t2 t6 t6 t6 t6 t6 t6 t6 t6 t1 t1 t1 t1 t1 t1 t1 t1 t3 t5 t3 t3 t5 t5 t5 t3 t3 t5 t3 t5 t3 t5 t5 t3 t7 t7 t7 t7 t7 t7 t7 t7 x z (unit-stride) y • Requires another complete code rewrite • CT allows for better L1 cache utilization when switching threads • Only effective on VF due to: • very small L1 cache (8 KB) shared by 8 HW threads • lack of hardware prefetchers (allows us to cut in contiguous dimension) • Drawback: Parameter space becomes very large No Collaboration With Collaboration Thread Blocks in x: 4 Large Coll. TBs in y: 4 Large Coll. TBs in z: 2 Thread Blocks in y: 2 Small Coll. TBs in y: 2 Thread Blocks in z: 2 Small Coll. TBs in z: 4 Cache Blocks in y: 2

  25. Collaborative Threading +Collaborative Threading +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls

  26. Autotuning Results +Collaborative Threading +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive 1.9x Better 5.4x Better Intel Clovertown AMD Barcelona 10.4x Better Sun Victoria Falls

  27. Architecture Comparison Double Precision Single Precision Performance Power Efficiency

  28. Conclusions • Compilers alone fail to fully utilize system resources • Programmers may not even know that system is being underutilized • Autotuning provides a portable and effective solution • Produces up to a 10.4x improvement over compiler alone • To make autotuning tractable: • Choose the order of optimizations appropriately for the platform • Prune the search space intelligently for large searches • Power efficiency has become a valuable metric • Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines

  29. Acknowledgements • Sam Williams for: • writing the Cell stencil code • guiding my work by autotuning SpMV and LBMHD • Vasily Volkov for writing the G80 CUDA code • Kathy Yelick and Jim Demmel for general advice and feedback

More Related