1 / 36

Memory Access Patterns For Cellular Automata Using GPGPUs

By: James M. Balasalle. Memory Access Patterns For Cellular Automata Using GPGPUs. Agenda. Background Information Different Patterns and Techniques Results Case Study: Surface Water Flow Conclusions Questions . Background Info: Parallel Processing.

donkor
Download Presentation

Memory Access Patterns For Cellular Automata Using GPGPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. By: James M. Balasalle Memory Access Patterns For Cellular Automata Using GPGPUs

  2. Agenda • Background Information • Different Patterns and Techniques • Results • Case Study: Surface Water Flow • Conclusions • Questions

  3. Background Info: Parallel Processing • How is parallel processing related to Moore’s Law? • Super Computers • Multicore CPUs • Interconnected, Independent Machines • Clusters, MPI • Grid Computing • GPUs

  4. Background Info: Cellular Automata • A cellular automaton (CA) is a discrete mathematical model used to calculate the global behavior of a complex system using (ideally) simple local rules. • Usually grid-based model of states • Values are determined by local neighbors • Wide range of applications

  5. Background Info: Conway’s Game of Life • The Game of Life, showing several well known patterns: crab, sliders, etc.

  6. Background Info: Conway’s Game of Life • Cellular Automaton • Cell has two states: alive and dead • Next state is based on the surrounding 8 neighbors • Alive Cell: • 2 or 3 live neighbors: stay alive, else die • Dead Cell: • Exactly 3 live neighbors: come alive, else stay dead • Simple rules lead to complex patterns

  7. Background Info: SIFT • Scale Invariant Feature Transform • Calculation of robust features in an image • Features can then be used to identify images or portions of an image • Widely used in Computer Vision Applications • From: http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html

  8. Background Info: SIFT • SIFT is a pipeline of successive operations • Initial Keypoint detection • Keypoint refinement, edge removal • Keypoint orientation calculation • Keypoint descriptor creation

  9. Background Info: SIFT • Focus is on Step 1: initial keypoint detection • Scale Space creation: successive Gaussian blurring and downsampling • Difference of Gaussians, adjacent in scale space • Local extrema detection in DoG • Resulting extrema are initial candidate keypoints

  10. Nvidia GPUs • External coprocessor card, connected to system bus • Manages its own DRAM store • Made up of one or more Streaming Multi-processors (SM) • Each SM contains • 8 Processing cores • 16KB of on-chip cache / storage • 2 Special Functional Units for transcendentals, etc

  11. Nvidia GPUs • Memory Regions: • Global Memory – non-cached memory, similar to RAM for CPU • Shared Memory – user-managed, on-chip cache • Texture Memory – alternative access path for accessing global memory, hardware calculations supported • Constant Memory – immutable cached memory store

  12. Patterns and Techniques • Two broad categories: • Resource Utilization • Different memory regions • Memory alignment and coalescence • Maximizing bus usage • Overhead Reduction • Instruction Reduction • Arithmetic intensity

  13. Patterns and Techniques • Global Memory • Conditional logic to handle boundary cells vs. memory halo • Halo achieves an 18% speed increase

  14. Patterns and Techniques • Shared vs. global memory • Utilize faster on-chip cache for frequently requested data • Shared memory is 30% faster

  15. Patterns and Techniques Coalescence: when all memory access requests for a half-warp are aggregated into a single request. • Aligned memory: • Align data on a 64 or 128-byte boundary • Achieved by padding each row • For a half-warp, coalescence reduces number of requests from 16 to 1 (or 2) • 8% performance improvement • Could possibly require significant host CPU processing

  16. Patterns and Techniques • Memory Region Shape • Minimum bus transaction is 32 bytes, even for 4-byte requests • Some halo cells are unaligned, minimize these • 16% faster

  17. Patterns and Techniques • Moving into overhead reduction and arithmetic intensity focused techniques • Index calculations, performed by every thread: • unsigned int row = blockIdx.y * blockDim.y + threadIdx.y; • unsigned intcol = blockIdx.x * blockDim.x + threadIdx.x; • intidx = row * width + col; • Approximately 15 total instructions to compute idx • For 1M data elements, 15,000,000 instructions devoted to index calculation

  18. Patterns and Techniques • Calculate 2 (or more) elements per thread • Calculate first index, using ~15 instructions • Calculate second index, relative to first, in a single add instruction • For 1M elements, 8,000,000 instructions; a 46% reduction • 44% performance improvement, over aligned memory

  19. Patterns and Techniques • Arithmetic intensity: ratio of actual computation to memory loads and index calculations • Multiple elements per thread • Multi-generation implementations • Data packing / interleaving

  20. Patterns and Techniques • Multi-generational kernel • Compute 2 generations in a single kernel launch • Reduces total index calculations • Reduces total memory loads • Uses shared memory for temporary storage

  21. Patterns and Techniques • Multi-generational kernel • Results are poor • Instruction count is limiting factor • Index calculations!

  22. Patterns and Techniques • Multi-generational kernel thread allocations • One thread per effective element • Results in many threads loading multiple elements • And computing multiple elements for each generation • Each load, computation requires index calculations • One thread for each element required to be loaded • Not implemented, future work

  23. SIFT Results • 2-element is faster, approximately 37% • Improvement due to instruction reduction • Gaussian Blur • Implemented as a non-separable convolution • Multiply a square matrix by each element and its neighbors • Square matrix is result of Gaussian function • Data elements are pixel values of image in question

  24. SIFT Results • Difference of Gaussians • Simply subtract results of blurring kernel • Kernel is extremely simple: more index calculations than effective operations • Kernel utilizes data packing • Too simple to measure

  25. SIFT Results • Extrema Detection • Each element compares itself to its neighbors • Minimum and maximum values are extrema

  26. SIFT Results • Extrema Detection • 2-element kernel is fastest • Rectangular kernel not effective since algorithm has built-in bounds checking

  27. Case Study: Surface Water Flow • Based on Masters Thesis by Jay Parsons • Using a digital elevation map, determine the amount and location of water during and after a rain event • Built upon a CA model that uses elevation distance between cells to determine where water flows

  28. Case Study: Surface Water Flow • Sample output

  29. Case Study: Surface Water Flow • Initial Steps • Port from Java to C++ • Gain understanding • Create a baseline implementation for timing comparisons • Initial GPU implementation • Application of techniques

  30. Case Study: Surface Water Flow • Problem: • During the processing of one cell, state values of its neighbors were updated • Design decision to make calculation of incoming water easier • Complicates CA implementation • Push vs. Pull methods

  31. Case Study: Surface Water Flow • Modify implementation, simply CA rules • New value is: • current value – outgoing volume + incoming volume • Incoming volume more difficult to calculate • Dramatic improvement: 3.6x speedup • Reduced instruction count • Better usage of shared memory

  32. Case Study: Surface Water Flow

  33. Recap • What worked • Shared memory, memory alignment, 2-element processing, rectangular regions • What didn’t work • Multi-generation kernels – more investigation needed • Future work • Data packing • Texture memory

  34. Observations • Balance between instruction-bound and memory-bound • Strict CA rules help performance and implementation • Powerful analysis tools required • Compromises • Shared Memory • 2-element processing • Rectangular regions

  35. Conclusion • GPUs are a great platform for cellular automata • Other problems that exhibit spatial locality • Techniques presented have real, measureable impact • Straightforward implementation • Applicable to wide range of problems • Worthwhile area of research

  36. Questions??

More Related