ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

Synchronization These notes will introduce: • Ways to achieve thread synchronization. • __syncthreads() ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

Thread Barrier Synchronization When we divide a computation into parallel parts to be done concurrently by independent threads, often need all threads to do their computation before processing next stage of computation In parallel programming, we call this barrier synchronization – all threads wait when they reach the barrier until all the threads have reached that point and then they are all released to continue

CUDA synchronization CUDA provides a synchronization barrier routine for those threads within each block __syncthreads() This routine would be used within a kernel. Threads would waits at this point until all threads in the block have reached it and they are all released. NOTE only synchronizes with other threads in block

Threads only synchronize with other threads in the block Kernel code __global void mykernel () { . . . __syncthreads() . . . } Block n-1 Block 0 Barrier Barrier Continue Continue Separate barriers

__syncthreads() constraints All threads must reach a particular __syncthreads() routine or deadlock occurs. Multiple __syncthreads() can be used in a kernel but each one is unique. Hence cannot have: if ... __syncthreads() else … __syncthreads() and expect threads going thro different paths to be synchronized. They all must go through the if or all go through the else clause, ideally for efficiency reaching the __synthreads() at the same time

Global Kernel Barrier Unfortunately no global kernel barrier routine available in CUDA Often we ant to synchronized all threads in computation To do that, have to use workarounds such as returning from kernel and placing a barrier in CPU code

CUDA synchronzation in the CPU code The following could be used: cudaThreadSynchronize() waits until all preceding commands in all “streams” have completed.

Reasoning behind not having CUDA global synchronization Expensive to implement for a large number of GPU processors At the block level, allows blocks to be executed in any order on GPU Can use different sizes of blocks depending upon the resources of GPU – so-called “transparent scalability”

Achieving global synchronization through multiple kernel launches Each kernel launch can be used as a synchronization point. Note kernels are asynchronous so need a host synchronization call such as cudaMemcpy Kernel launches efficiently implemented: - Minimal hardware overhead - Little software overhead Recursion -- not allowed within kernel but can be used in host code to launch kernels

Code Example N-body problem Need to compute forces on each body in each time interval and then update positions and velocities of bodies and then repeat. for (t = 0; t < tmax; t++) { // for each time period, force calculation on all bodies cudaMemcpy(dev_A, A ,arraySize,cudaMemcpyHostToDevice); // data to GPU bodyCal<<<B,T>>>(dev_A); // kernel call cudaMemcpy(A,dev_A,arraySize,cudaMemcpyDeviceToHost); // updated data } // end of time period loop No specific synchronization needed in kernel routine

Other ways to achieve global synchronization (if it cannot be avoided) • CUDA memory fence __threadfence() that waits to memory operations to be visible to other threads but probably is not useable for synchronization • Write your own code for the kernel that implements global synchronization How? (Using atomics and critical sections see next)

Discussion points • Using writing to global memory to enforce synchronization expensive

Questions

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

Presentation Transcript

Intro to Banner HR

ITCS 6010

ITCS 6010 DATA INTEGRATION Presentation On Social Web

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streamsx

Types of Parallel Computers

Blandine Serve, Eric Gonnard ( OECD Statistics Directorate)

Grid Portals

ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Lecture 7 Classification

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing

The completion of OECD trade databases in OECD.STAT

ITCS 6265 Information Retrieval and Web Mining

ITCS 3153 Artificial Intelligence

Numerical Algorithms • Parallelizing matrix multiplication • Solving a system of linear equations

ITCS 6114 Dynamic programming Longest Common Subsequence

Anatomy of a disaster recovery

ITCS 6265/8265 Project

Globus Toolkit 4.0 Grid Resource Allocation Manager (GRAM) Job submission

Partitioning and Divide-and-Conquer Strategies

Book Drawing

Security II – Grid Computing Security and Globus Security

ITCS 6010