1 / 9

nBody Simulation with CUDA

Joshua Brunner, Alexander Zdun , Erik Stadler. nBody Simulation with CUDA. Thread Syncronization. __ syncthreads () Keeps threads in a block synchronized Can be bad if called too often, especially if there are a lot of threads in a block.

dory
Download Presentation

nBody Simulation with CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joshua Brunner, Alexander Zdun, Erik Stadler nBody Simulation with CUDA

  2. Thread Syncronization • __syncthreads() • Keeps threads in a block synchronized • Can be bad if called too often, especially if there are a lot of threads in a block. • One way to optimize – eliminate unneeded synchronization calls. • cudaThreadSynchronize(); • Called from main function • Blocking condition to make sure results are printed out after N number of steps

  3. Shared Memory • Optimize memory calls by moving data from global device memory into device shared memory • Bad to make lots of memory calls, but don’t overuse registers!

  4. Register Optimization • Variables are stored in registers • Faster than shared memory, but very limited. • Want to ensure less registers so that you can spawn more thread blocks. • Increase warp occupancy (100% warp not always important) • Increasing the occupancy can reduce memory latency issues (more things can be running while other stuff is doing memory I/O)

  5. Performance Evaluation • CUDA vs. OMP • Benched CUDA vs OMP • OMP benched on AMD Phenom 9950 BE (4x512KB L2, 2MB L3) @ 3GHz and University Linux systems (Data, Mikey). • CUDA averaged 3x speedup from 4 core AMD with OMP. • On 2 SM ION (Not benched on G80/GT200)

  6. Performance Chart

  7. Performance Chart (2)

  8. ION vs G80 vs GF100 • nVidia ION has 8 cores per SM. • 2 SMs, means it has 16 cores. • G80 has 8 cores per SM • 16 SMs, 128 SPs (cores) • GF100 (Fermi) • 512 CUDA Cores • 32 cores / SM • A lot higher memory bandwidth (GDDR5)

  9. Arch. Continued • Double precision vs single precision • Serialized program used double precision. • Opted for single precision for increased performance (GT200) • Side note: G80 demotes double type to 32-bit floats in software. • GT200 architecture takes a large performance hit when making double precision calculations. (Performance is 1/8 of max with Double Precision) • GF100 only takes a ½ performance hit.

More Related