Transcript:
Section six advanced CUDA topics in this section. We’ll learn about some more specialized capabilities of CUDA. Well learn how to maximize concurrency of kernel, execution, memory, transfers and host execution how to program with multiple CUDA devices and how to dynamically allocate memory and launch kernels from within device code concurrency and streams, in addition to the thread and instruction level parallelism that we’ve learned about already with CUDA, we can concurrently execute multiple kernels, transfer data in memory and run host code. This can help us get the most out of the hardware and increased performance of our programs. There are basically three different areas of concurrency that we want to think about in. Kuta programming first we can launch multiple independent kernels at the same time. If one kernel isn’t fully utilizing the device, CUDA will schedule blocks from other kernels to run simultaneously. Second, while code is running on the device, we can also run code simultaneously on the host. Finally, while kernels are running, we can simultaneously transfer data to and from the device or copy data on the device. We’ll look at the first two of these in this video and we’ll learn about concurrent memory transfers in the next video. Let’s look at an example this kernel implements a reduction, but it takes a different approach than we used back in section 4 it will run with a single block to start out. Each thread will loop through the input, summing up values until it gets to the end notice how the indexing here ensures that all of our reads will coalesce. Then each thread takes its own sum and passes it to a block level reduction function. This is the same function we wrote. In Section 4 it uses shared memory to perform a reduction within the block, returning the final sum to all threads. Finally, the first thread of the block writes the result of global memory in some ways. This is an ideal kernel. All its reads coalesce. It doesn’t have to store any temporary data in global memory and it doesn’t need any synchronization between blocks, but there’s one big downside. It uses a single block to reduce the entire input, so our device is going to be mostly idle while it’s running. Let’s see how this affects performance. Our original reduced colonel took about six milliseconds and our new one takes 24 milliseconds for the same sized input. So that’s not great, but if we could use the idle parts of the GPU to do some other work concurrently, we might actually get really good throughput from this, for example, suppose? I had a bunch of different arrays to reduce. I could do them all at the same time and fully utilize the device. I’ve set up this example to do just that this constant end screams determines how many reductions. I run concurrently. We’ll, look at the details of how that works in a minute for now, let’s increase it, Let’s say to eight. The runtime only went up by two milliseconds, but I did eight times as much work. If we ran the original reduced kernel eight times in a row on this size input, it would probably take 40 or 50 milliseconds, so we’ve already got a significant speed up lets. See what happens if we push it further and still only 31 milliseconds and now we’re doing 16 reductions in that time on my device performance seems to level off at this point. You may get different. Results on different hardware at any rate we’re looking at more than twice the throughput of the original reduced kernel. So how exactly do we run all these kernels concurrently in? CUDA concurrency is controlled through the use of streams. Every kernel launch is associated with the stream. We can pass the stream as an optional argument to the execution configuration. If we don’t specify that argument. Cuda uses the default stream All of our programs so far have used the default within a stream. All the commands will run in order, but if we have multiple streams, their commands can run concurrently. Let’s take a look at the code for that. We start by making an array of streams this. Cuda stream T type is just a handle to a stream. Now we create each stream with CUDA stream create and we allocate a separate input and output array for each stream since they’re going to run concurrently, they need their own independent data, so this is going to use a lot more memory than reducing one array at a time, but as long as we don’t run out, that’s not a problem. Now we launch all the kernels notice that in the execution configuration. I passed the stream as the last argument. Each kernel is going to run in its own stream, so they can all be scheduled concurrently. If there’s room on the device when we call CUDA device synchronize, it will wait for all the streams to finish, so we’ve seen how to run kernels concurrently. We have to create separate streams and spread our kernels across them. Within each stream, we can launch multiple kernels and they’ll run in order. This is the way to go when one kernel needs the results from a previous kernel in order to run. If your kernels are independent, you can run them in separate streams to get concurrent execution well. All our screens are running. We can also do work on the host. For example, I could call stood accumulate here to run a reduction on the host. As long as this finishes faster than my kernels. It won’t slow down the overall execution. You on my hard lair. Reducing the whole array on the host does slow things down. Let’s try reducing half the array, and now it runs just as fast in practice, it’s often tricky to make use of concurrent execution on the host because transferring data back and forth can become a bottleneck, but for some workloads, it can help you squeeze out a little extra performance. When running concurrent kernels, sometimes we need to synchronize between streams. There are a few ways of doing this. CUDA device synchronize, which we’ve already seen will wait for all streams to finish. CUDA stream synchronized takes a stream as an argument and well. Wait for that particular stream to finish CUDA stream query. Lets you check whether a stream has finished without waiting for more complicated synchronization needs. CUDA also supports events. You can create an event with CUDA event. Create, this just gives you a handle that you can use later. CUDA event record inserts an event into a stream so when all previous commands in the stream have finished, the event will be recorded. You can wait for an event on the host with CUDA event synchronized and you can make a stream Wait for an event at a particular spot with CUDA stream. Wait for event for more details on events. Check out the CUDA programming guide. Several of the CUDA sample programs also make use of events, you you.