Transcript:
Since threads are allowed to run in parallel, we can easily run into race conditions. The most common form of race condition occurs when a thread reads from a memory location before the correct data has been stored into that memory location or vice versa, if the thread writes its results to a specific memory location before another thread has read the data from that location in order to avoid these race conditions. We need a way to allow our threads to synchronize and in a sense force portions of the device code to run in serial to force threads within a block to synchronize with each other, We can implement a barrier. A barrier is a point in a program where all of the threads that are executing a kernel within a block will halt their execution when they reach this barrier point for illustrative purposes. Let’s assume we have four threads within a block executing in parallel at a point in the kernel, where we need the threads to synchronize, we can implement a barrier, which will cause each thread within this block to pause at this point until all of the threads within this block have reached the barrier now once all the threads within a specific block have reached this barrier point, we then say that the threads have been synchronized and the threads within this block are then allowed to continue their execution. We can explicitly implement a barrier by using the sync threads keyword, so let’s take a look at an example of how to use thread synchronization. Let’s assume that we want to shift the contents of an array to the left by one element for this example, let’s assume this array named a is stored in global memory and as of length 4 now we can implement this process in parallel using a CUDA kernel or a pointer that points to the memory address of the first element of the array in global memory is passed into the kernel as an argument now for each thread that implements this kernel, we will first grab the thread’s unique index within its block, and we will store that value in a local variable named I which will actually be stored in a register. Then we can perform the operation to shift each element in the array to the left by one now just to be safe and ensure that we don’t step off the end of the array we can implement this shifting operation inside of an if statement, now this code won’t actually work as we intended to because this shifting operation is actually a read and a write operation, so what we need to do is ensure that all of the elements have been read from before they are written to in this global array variable a now we can easily solve this problem by creating a variable that will be stored in a register named temp the contents of the. I plus 1 element of a is copied into temp. Now since threads execute in parallel, we want to implement a barrier at this point to ensure that all of the threads that are executing this kernel will have read their data stored in their. I plus 1 element of a before. Any other thread writes in to this element. Now that we are certain, all of the data from the entire array has been stored into either a register or is spilled over into local memory. If the array were very large, we can now write the data into the global array variable a so we will grab the contents stored in the temp variable now and write those contents into the eighth element of the global array variable, a now just to be safe and ensure that all of the writes have occurred before any other threads read from the array we can implement another barrier synchronization. In addition to the explicit barriers used for thread synchronization. We’ve discussed so far in this lecture. There is another major form of a barrier which is implemented implicitly between kernel launches before we discuss this. Let’s actually take a step back and look at the asynchronous behavior between the interaction between the host and device code. Now, as we discussed in a previous lecture, the host does not wait on the completion of a kernel to continue its flow in order to force the host code to wait on a kernel’s completion. We can use the keyword. CUDA device synchronize when we use this keyword CUDA device synchronize. The host will pause at this point until the previous kernel has completed its execution. So as we’ve seen, the host and device operate asynchronously unless the host code is explicitly specified to wait on the device. However, the execution of consecutive kernel launches do operate synchronously, which can be thought of as an implicit barrier between kernel launches. For example, If we launch two kernels consecutively, we are guaranteed that the grid from the second kernel launch will not be scheduled to execute on the device until the first kernel has completed its execution as described in the CUDA programming guide. There are three key abstractions at the core of CUDA, which consists of a hierarchy of computations with a corresponding memory hierarchy and finally barrier synchronization primitives as we’ve discussed in this lecture. So this lecture concludes the absolute basics of CUDA. In subsequent lectures, we will learn how to optimize CUDA programs to fully utilize all of the resources that the GPU offers us.