290 likes | 302 Views
This article provides an overview of stencil codes, their applications, and optimization techniques for improving performance on different architectures. It includes examples of optimizations such as loop unrolling, cache blocking, software prefetching, SIMDization, and cache bypassing. The optimizations are ordered appropriately for specific platforms, and an exhaustive search is performed to find the best parameters for each optimization. The article also discusses the benefits of NUMA-awareness and padding.
E N D
Tuning Stencils Kaushik Datta Microsoft Site Visit April 29, 2008
Stencil Code Overview • For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) • A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”) 2D Stencil 3D Stencil
Stencil Applications • Stencils are critical to many scientific applications: • Diffusion, Electromagnetics, Computational Fluid Dynamics • Both uniform and adaptive block-structured meshes • Many type of stencils • 1D, 2D, 3D meshes • Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) • Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes) • Varying boundary conditions (constant vs. periodic)
Naïve Stencil Code void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } } } }
Our Stencil Code • Executes a 3D, 7-point, Jacobi iteration on a 2563 grid • Performs 8 flops (6 adds, 2 mults) per point • Parallelization performed with pthreads • Thread affinity: multithreading, then multicore, then multisocket • Flop:Byte Ratio • 0.33 (write allocate architectures) • 0.5 (Ideal)
Cache-Based Architectures Intel Clovertown AMD Barcelona Sun Victoria Falls
Autotuning • Provides a portable and effective method for tuning • Limiting the search space: • Searching the entire space is intractable • Instead, we ordered the optimizations appropriately for a given platform • To find best parameters for a given optimization, performed exhaustive search • Each optimization was applied on top of all previous optimizations • In general, can also use heuristics/models to prune search space
Naive Code x z (unit-stride) y • Naïve code is a simple, threaded stencil kernel • Domain partitioning performed only in least contiguous dimension • No optimizations or tuning was performed
Naïve Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
NUMA-Aware Intel Clovertown AMD Barcelona Sun Victoria Falls • Exploited “first-touch” page mapping policy on NUMA architectures • Due to our affinity policy, benefit only seen when using both sockets
NUMA-Aware +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Loop Unrolling/Reordering • Allows for better use of registers and functional units • Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF) • should eliminate any effects from memory subsystem • This optimization is independent of later memory optimizations
Loop Unrolling/Reordering +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Padding x z (unit-stride) y • Used to reduce conflict misses and DRAM bank conflicts • Drawback: Larger memory footprint • Performed search to determine best padding amount • Only padded in unit-stride dimension Padding Amount
Padding +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Thread/Cache Blocking x z (unit-stride) y • Performed exhaustive search over all possible power-of-two parameter values • Every thread block is the same size and shape • Preserves load balancing • Did NOT cut in contiguous dimension on x86 machines • Avoids interrupting HW prefetchers • Only performed cache blocking in one dimension • Sufficient to fit three read planes and one write plane into cache Thread Blocks in x: 4 Thread Blocks in y: 2 Thread Blocks in z: 2 Cache Blocks in y: 2
Thread/Cache Blocking +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Software Prefetching • Allows us to hide memory latency • Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)
Software Prefetching +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
SIMDization • Requires complete code rewrite to utilize 128-bit SSE registers • Allows single instruction to add/multiply two doubles • Only possible on the x86 machines • Padding performed to achieve proper data alignment (not to avoid conflicts) • Searched over register block sizes and prefetch distances simultaneously
SIMDization +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Cache Bypass • Writes data directly to write-back buffer • No data load on write miss • Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 • Reduces memory data traffic by 33% • Still requires the SIMDized code from the previous optimization • Searched over register block sizes and prefetch distances simultaneously
Cache Bypass +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Collaborative Threading z (unit-stride) y t0 t0 t0 t0 t0 t0 t0 t0 t2 t4 t2 t2 t2 t2 t4 t4 t2 t4 t4 t4 t4 t2 t4 t2 t6 t6 t6 t6 t6 t6 t6 t6 t1 t1 t1 t1 t1 t1 t1 t1 t3 t5 t3 t3 t5 t5 t5 t3 t3 t5 t3 t5 t3 t5 t5 t3 t7 t7 t7 t7 t7 t7 t7 t7 x z (unit-stride) y • Requires another complete code rewrite • CT allows for better L1 cache utilization when switching threads • Only effective on VF due to: • very small L1 cache (8 KB) shared by 8 HW threads • lack of hardware prefetchers (allows us to cut in contiguous dimension) • Drawback: Parameter space becomes very large No Collaboration With Collaboration Thread Blocks in x: 4 Large Coll. TBs in y: 4 Large Coll. TBs in z: 2 Thread Blocks in y: 2 Small Coll. TBs in y: 2 Thread Blocks in z: 2 Small Coll. TBs in z: 4 Cache Blocks in y: 2
Collaborative Threading +Collaborative Threading +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Autotuning Results +Collaborative Threading +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive 1.9x Better 5.4x Better Intel Clovertown AMD Barcelona 10.4x Better Sun Victoria Falls
Architecture Comparison Double Precision Single Precision Performance Power Efficiency
Conclusions • Compilers alone fail to fully utilize system resources • Programmers may not even know that system is being underutilized • Autotuning provides a portable and effective solution • Produces up to a 10.4x improvement over compiler alone • To make autotuning tractable: • Choose the order of optimizations appropriately for the platform • Prune the search space intelligently for large searches • Power efficiency has become a valuable metric • Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines
Acknowledgements • Sam Williams for: • writing the Cell stencil code • guiding my work by autotuning SpMV and LBMHD • Vasily Volkov for writing the G80 CUDA code • Kathy Yelick and Jim Demmel for general advice and feedback