290 likes | 411 Views
Tuning Stencils. Kaushik Datta Microsoft Site Visit April 29, 2008. Stencil Code Overview. For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)
E N D
Tuning Stencils Kaushik Datta Microsoft Site Visit April 29, 2008
Stencil Code Overview • For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) • A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”) 2D Stencil 3D Stencil
Stencil Applications • Stencils are critical to many scientific applications: • Diffusion, Electromagnetics, Computational Fluid Dynamics • Both uniform and adaptive block-structured meshes • Many type of stencils • 1D, 2D, 3D meshes • Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) • Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes) • Varying boundary conditions (constant vs. periodic)
Naïve Stencil Code void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } } } }
Our Stencil Code • Executes a 3D, 7-point, Jacobi iteration on a 2563 grid • Performs 8 flops (6 adds, 2 mults) per point • Parallelization performed with pthreads • Thread affinity: multithreading, then multicore, then multisocket • Flop:Byte Ratio • 0.33 (write allocate architectures) • 0.5 (Ideal)
Cache-Based Architectures Intel Clovertown AMD Barcelona Sun Victoria Falls
Autotuning • Provides a portable and effective method for tuning • Limiting the search space: • Searching the entire space is intractable • Instead, we ordered the optimizations appropriately for a given platform • To find best parameters for a given optimization, performed exhaustive search • Each optimization was applied on top of all previous optimizations • In general, can also use heuristics/models to prune search space
Naive Code x z (unit-stride) y • Naïve code is a simple, threaded stencil kernel • Domain partitioning performed only in least contiguous dimension • No optimizations or tuning was performed
Naïve Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
NUMA-Aware Intel Clovertown AMD Barcelona Sun Victoria Falls • Exploited “first-touch” page mapping policy on NUMA architectures • Due to our affinity policy, benefit only seen when using both sockets
NUMA-Aware +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Loop Unrolling/Reordering • Allows for better use of registers and functional units • Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF) • should eliminate any effects from memory subsystem • This optimization is independent of later memory optimizations
Loop Unrolling/Reordering +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Padding x z (unit-stride) y • Used to reduce conflict misses and DRAM bank conflicts • Drawback: Larger memory footprint • Performed search to determine best padding amount • Only padded in unit-stride dimension Padding Amount
Padding +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Thread/Cache Blocking x z (unit-stride) y • Performed exhaustive search over all possible power-of-two parameter values • Every thread block is the same size and shape • Preserves load balancing • Did NOT cut in contiguous dimension on x86 machines • Avoids interrupting HW prefetchers • Only performed cache blocking in one dimension • Sufficient to fit three read planes and one write plane into cache Thread Blocks in x: 4 Thread Blocks in y: 2 Thread Blocks in z: 2 Cache Blocks in y: 2
Thread/Cache Blocking +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Software Prefetching • Allows us to hide memory latency • Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)
Software Prefetching +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
SIMDization • Requires complete code rewrite to utilize 128-bit SSE registers • Allows single instruction to add/multiply two doubles • Only possible on the x86 machines • Padding performed to achieve proper data alignment (not to avoid conflicts) • Searched over register block sizes and prefetch distances simultaneously
SIMDization +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Cache Bypass • Writes data directly to write-back buffer • No data load on write miss • Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 • Reduces memory data traffic by 33% • Still requires the SIMDized code from the previous optimization • Searched over register block sizes and prefetch distances simultaneously
Cache Bypass +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Collaborative Threading z (unit-stride) y t0 t0 t0 t0 t0 t0 t0 t0 t2 t4 t2 t2 t2 t2 t4 t4 t2 t4 t4 t4 t4 t2 t4 t2 t6 t6 t6 t6 t6 t6 t6 t6 t1 t1 t1 t1 t1 t1 t1 t1 t3 t5 t3 t3 t5 t5 t5 t3 t3 t5 t3 t5 t3 t5 t5 t3 t7 t7 t7 t7 t7 t7 t7 t7 x z (unit-stride) y • Requires another complete code rewrite • CT allows for better L1 cache utilization when switching threads • Only effective on VF due to: • very small L1 cache (8 KB) shared by 8 HW threads • lack of hardware prefetchers (allows us to cut in contiguous dimension) • Drawback: Parameter space becomes very large No Collaboration With Collaboration Thread Blocks in x: 4 Large Coll. TBs in y: 4 Large Coll. TBs in z: 2 Thread Blocks in y: 2 Small Coll. TBs in y: 2 Thread Blocks in z: 2 Small Coll. TBs in z: 4 Cache Blocks in y: 2
Collaborative Threading +Collaborative Threading +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive Intel Clovertown AMD Barcelona Sun Victoria Falls
Autotuning Results +Collaborative Threading +Cache Bypass +SIMDization +Prefetching +Thread/Cache Blocking +Padding +Loop Unrolling/Reordering +NUMA-Aware Naive 1.9x Better 5.4x Better Intel Clovertown AMD Barcelona 10.4x Better Sun Victoria Falls
Architecture Comparison Double Precision Single Precision Performance Power Efficiency
Conclusions • Compilers alone fail to fully utilize system resources • Programmers may not even know that system is being underutilized • Autotuning provides a portable and effective solution • Produces up to a 10.4x improvement over compiler alone • To make autotuning tractable: • Choose the order of optimizations appropriately for the platform • Prune the search space intelligently for large searches • Power efficiency has become a valuable metric • Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines
Acknowledgements • Sam Williams for: • writing the Cell stencil code • guiding my work by autotuning SpMV and LBMHD • Vasily Volkov for writing the G80 CUDA code • Kathy Yelick and Jim Demmel for general advice and feedback