360 likes | 434 Views
By: James M. Balasalle. Memory Access Patterns For Cellular Automata Using GPGPUs. Agenda. Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions . Background Info: Parallel Processing.
E N D
By: James M. Balasalle Memory Access Patterns For Cellular Automata Using GPGPUs
Agenda • Background Information • Different Patterns and Techniques • Results • Case Study: Surface Water Flow • Conclusions • Questions
Background Info: Parallel Processing • How is parallel processing related to Moore’s Law? • Super Computers • Multicore CPUs • Interconnected, Independent Machines • Clusters, MPI • Grid Computing • GPUs
Background Info: Cellular Automata • A cellular automaton (CA) is a discrete mathematical model used to calculate the global behavior of a complex system using (ideally) simple local rules. • Usually grid-based model of states • Values are determined by local neighbors • Wide range of applications
Background Info: Conway’s Game of Life • The Game of Life, showing several well known patterns: crab, sliders, etc.
Background Info: Conway’s Game of Life • Cellular Automaton • Cell has two states: alive and dead • Next state is based on the surrounding 8 neighbors • Alive Cell: • 2 or 3 live neighbors: stay alive, else die • Dead Cell: • Exactly 3 live neighbors: come alive, else stay dead • Simple rules lead to complex patterns
Background Info: SIFT • Scale Invariant Feature Transform • Calculation of robust features in an image • Features can then be used to identify images or portions of an image • Widely used in Computer Vision Applications • From: http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html
Background Info: SIFT • SIFT is a pipeline of successive operations • Initial Keypoint detection • Keypoint refinement, edge removal • Keypoint orientation calculation • Keypoint descriptor creation
Background Info: SIFT • Focus is on Step 1: initial keypoint detection • Scale Space creation: successive Gaussian blurring and downsampling • Difference of Gaussians, adjacent in scale space • Local extrema detection in DoG • Resulting extrema are initial candidate keypoints
Nvidia GPUs • External coprocessor card, connected to system bus • Manages its own DRAM store • Made up of one or more Streaming Multi-processors (SM) • Each SM contains • 8 Processing cores • 16KB of on-chip cache / storage • 2 Special Functional Units for transcendentals, etc
Nvidia GPUs • Memory Regions: • Global Memory – non-cached memory, similar to RAM for CPU • Shared Memory – user-managed, on-chip cache • Texture Memory – alternative access path for accessing global memory, hardware calculations supported • Constant Memory – immutable cached memory store
Patterns and Techniques • Two broad categories: • Resource Utilization • Different memory regions • Memory alignment and coalescence • Maximizing bus usage • Overhead Reduction • Instruction Reduction • Arithmetic intensity
Patterns and Techniques • Global Memory • Conditional logic to handle boundary cells vs. memory halo • Halo achieves an 18% speed increase
Patterns and Techniques • Shared vs. global memory • Utilize faster on-chip cache for frequently requested data • Shared memory is 30% faster
Patterns and Techniques Coalescence: when all memory access requests for a half-warp are aggregated into a single request. • Aligned memory: • Align data on a 64 or 128-byte boundary • Achieved by padding each row • For a half-warp, coalescence reduces number of requests from 16 to 1 (or 2) • 8% performance improvement • Could possibly require significant host CPU processing
Patterns and Techniques • Memory Region Shape • Minimum bus transaction is 32 bytes, even for 4-byte requests • Some halo cells are unaligned, minimize these • 16% faster
Patterns and Techniques • Moving into overhead reduction and arithmetic intensity focused techniques • Index calculations, performed by every thread: • unsigned int row = blockIdx.y * blockDim.y + threadIdx.y; • unsigned intcol = blockIdx.x * blockDim.x + threadIdx.x; • intidx = row * width + col; • Approximately 15 total instructions to compute idx • For 1M data elements, 15,000,000 instructions devoted to index calculation
Patterns and Techniques • Calculate 2 (or more) elements per thread • Calculate first index, using ~15 instructions • Calculate second index, relative to first, in a single add instruction • For 1M elements, 8,000,000 instructions; a 46% reduction • 44% performance improvement, over aligned memory
Patterns and Techniques • Arithmetic intensity: ratio of actual computation to memory loads and index calculations • Multiple elements per thread • Multi-generation implementations • Data packing / interleaving
Patterns and Techniques • Multi-generational kernel • Compute 2 generations in a single kernel launch • Reduces total index calculations • Reduces total memory loads • Uses shared memory for temporary storage
Patterns and Techniques • Multi-generational kernel • Results are poor • Instruction count is limiting factor • Index calculations!
Patterns and Techniques • Multi-generational kernel thread allocations • One thread per effective element • Results in many threads loading multiple elements • And computing multiple elements for each generation • Each load, computation requires index calculations • One thread for each element required to be loaded • Not implemented, future work
SIFT Results • 2-element is faster, approximately 37% • Improvement due to instruction reduction • Gaussian Blur • Implemented as a non-separable convolution • Multiply a square matrix by each element and its neighbors • Square matrix is result of Gaussian function • Data elements are pixel values of image in question
SIFT Results • Difference of Gaussians • Simply subtract results of blurring kernel • Kernel is extremely simple: more index calculations than effective operations • Kernel utilizes data packing • Too simple to measure
SIFT Results • Extrema Detection • Each element compares itself to its neighbors • Minimum and maximum values are extrema
SIFT Results • Extrema Detection • 2-element kernel is fastest • Rectangular kernel not effective since algorithm has built-in bounds checking
Case Study: Surface Water Flow • Based on Masters Thesis by Jay Parsons • Using a digital elevation map, determine the amount and location of water during and after a rain event • Built upon a CA model that uses elevation distance between cells to determine where water flows
Case Study: Surface Water Flow • Sample output
Case Study: Surface Water Flow • Initial Steps • Port from Java to C++ • Gain understanding • Create a baseline implementation for timing comparisons • Initial GPU implementation • Application of techniques
Case Study: Surface Water Flow • Problem: • During the processing of one cell, state values of its neighbors were updated • Design decision to make calculation of incoming water easier • Complicates CA implementation • Push vs. Pull methods
Case Study: Surface Water Flow • Modify implementation, simply CA rules • New value is: • current value – outgoing volume + incoming volume • Incoming volume more difficult to calculate • Dramatic improvement: 3.6x speedup • Reduced instruction count • Better usage of shared memory
Recap • What worked • Shared memory, memory alignment, 2-element processing, rectangular regions • What didn’t work • Multi-generation kernels – more investigation needed • Future work • Data packing • Texture memory
Observations • Balance between instruction-bound and memory-bound • Strict CA rules help performance and implementation • Powerful analysis tools required • Compromises • Shared Memory • 2-element processing • Rectangular regions
Conclusion • GPUs are a great platform for cellular automata • Other problems that exhibit spatial locality • Techniques presented have real, measureable impact • Straightforward implementation • Applicable to wide range of problems • Worthwhile area of research