90 likes | 283 Views
Joshua Brunner, Alexander Zdun , Erik Stadler. nBody Simulation with CUDA. Thread Syncronization. __ syncthreads () Keeps threads in a block synchronized Can be bad if called too often, especially if there are a lot of threads in a block.
E N D
Joshua Brunner, Alexander Zdun, Erik Stadler nBody Simulation with CUDA
Thread Syncronization • __syncthreads() • Keeps threads in a block synchronized • Can be bad if called too often, especially if there are a lot of threads in a block. • One way to optimize – eliminate unneeded synchronization calls. • cudaThreadSynchronize(); • Called from main function • Blocking condition to make sure results are printed out after N number of steps
Shared Memory • Optimize memory calls by moving data from global device memory into device shared memory • Bad to make lots of memory calls, but don’t overuse registers!
Register Optimization • Variables are stored in registers • Faster than shared memory, but very limited. • Want to ensure less registers so that you can spawn more thread blocks. • Increase warp occupancy (100% warp not always important) • Increasing the occupancy can reduce memory latency issues (more things can be running while other stuff is doing memory I/O)
Performance Evaluation • CUDA vs. OMP • Benched CUDA vs OMP • OMP benched on AMD Phenom 9950 BE (4x512KB L2, 2MB L3) @ 3GHz and University Linux systems (Data, Mikey). • CUDA averaged 3x speedup from 4 core AMD with OMP. • On 2 SM ION (Not benched on G80/GT200)
ION vs G80 vs GF100 • nVidia ION has 8 cores per SM. • 2 SMs, means it has 16 cores. • G80 has 8 cores per SM • 16 SMs, 128 SPs (cores) • GF100 (Fermi) • 512 CUDA Cores • 32 cores / SM • A lot higher memory bandwidth (GDDR5)
Arch. Continued • Double precision vs single precision • Serialized program used double precision. • Opted for single precision for increased performance (GT200) • Side note: G80 demotes double type to 32-bit floats in software. • GT200 architecture takes a large performance hit when making double precision calculations. (Performance is 1/8 of max with Double Precision) • GF100 only takes a ½ performance hit.