

# Practical Course: GPU Programming in Computer Vision

Optimization

Björn Häfner, Benedikt Löwenhauser, Thomas Möllenhoff

Technische Universität München Department of Informatics Computer Vision Group

Summer Semester 2017 September 11 - October 8





### **Outline**

- Performance metrics of algorithms running on a GPU
  - occupancy
  - data bandwith and instruction throughput
- Maximize instruction throughput
  - branch divergence
- Maximize memory throughput
  - pitched allocation for images
- parallel reduction: an example of optimization





### **Outline**

- Performance metrics of algorithms running on a GPU
  - occupancy
  - data bandwith and instruction throughput
- - branch divergence
- - pitched allocation for images

$$\mbox{occupancy} = \frac{\mbox{active threads}}{\mbox{max. threads per SM}} \label{eq:max}$$

- Pool of limited resources per SM
- Occupancy determined by
  - Register usage per thread





$$occupancy = \frac{active \ threads}{max. \ threads \ per \ SM}$$

- Multiprocessors (SMs) can have many more active threads than there are CUDA Cores
- Pool of limited resources per SM
- Occupancy determined by
  - Register usage per thread



$$occupancy = \frac{active \ threads}{max. \ threads \ per \ SM}$$

- Multiprocessors (SMs) can have many more active threads than there are CUDA Cores
- High occupancy is important, because if some threads stall, the SM can switch to others
- Pool of limited resources per SM
- Occupancy determined by
  - Register usage per thread
  - Shared memory per block





$$occupancy = \frac{active \ threads}{max. \ threads \ per \ SM}$$

- Multiprocessors (SMs) can have many more active threads than there are CUDA Cores
- High occupancy is important, because if some threads stall, the SM can switch to others
- Pool of limited resources per SM
- Occupancy determined by
  - Register usage per thread
  - Shared memory per block



$$occupancy = \frac{active \ threads}{max. \ threads \ per \ SM}$$

- Multiprocessors (SMs) can have many more active threads than there are CUDA Cores
- High occupancy is important, because if some threads stall, the SM can switch to others
- Pool of limited resources per SM
- Occupancy determined by
  - Register usage per thread
  - Shared memory per block





### Resource limits



- Each block grabs registers and shared memory
- If one or the other is fully utilized: no more blocks per SM possible

### Find Out Resource Usage

- Compile with nvcc option -ptxas-options=-v
- Per kernel registers and (static) shared memory:

```
ptxas info: Compiling entry function '_Z10add_kernelPfPKfS1_i' for 'sm_10'
ptxas info: Used 4 registers, 44 bytes smem
```

Amount of resources per multiprocessor:

### Find Out Resource Usage

- Compile with nvcc option -ptxas-options=-v
- Per kernel registers and (static) shared memory:

```
ptxas info: Compiling entry function '_Z10add kernelPfPKfS1_i' for 'sm_10'
ptxas info: Used 4 registers, 44 bytes smem
```

Amount of resources per multiprocessor: ./deviceQuery

#### data bandwidth: How much data do we process per second?

- Make use of the different types of memory
- Align your 2D array to make use of coalescing

- Trade precision for speed
- Minimize branch divergence

data bandwidth: How much data do we process per second?

- Minimize data transfers with low bandwidth (host device. global memory - device)
- Make use of the different types of memory
- Align your 2D array to make use of coalescing

- Trade precision for speed
- Minimize branch divergence



data bandwidth: How much data do we process per second?

- Minimize data transfers with low bandwidth (host device. global memory - device)
- Make use of the different types of memory
- Align your 2D array to make use of coalescing

instruction throughput: How many instructions do we execute per second?

- Trade precision for speed
- Minimize branch divergence



data bandwidth: How much data do we process per second?

- Minimize data transfers with low bandwidth (host device. global memory - device)
- Make use of the different types of memory
- Align your 2D array to make use of coalescing

**instruction throughput:** How many instructions do we execute per second?

- Trade precision for speed
- Minimize branch divergence



### Outline

- 1 Performance metrics of algorithms running on a GPU
  - occupancy
  - data bandwith and instruction throughput
- 2 Maximize instruction throughput
  - branch divergence
- 3 Maximize memory throughput
  - pitched allocation for images
- 4 parallel reduction: an example of optimization





### Reminder: All 32 threads of a warp execute the same instruction. *Always!*

```
1 __global__ void kernel (float *result, float *input)
2 {
3    int i = threadIdx.x + blockDim.x*blockIdx.x;
4    if (input[i]>0)
5        result[i] = 1.f;
6    else
7        result[i] = 0.f;
8 }
```



### branch divergence

# Reminder: All 32 threads of a warp execute the *same* instruction. *Always!*

```
1 __global__ void kernel (float *result, float *input)
2 {
3    int i = threadIdx.x + blockDim.x*blockIdx.x;
4    if (input[i]>0)
5        result[i] = 1.f;
6    else
7        result[i] = 0.f;
8 }
```





# Reminder: All 32 threads of a warp execute the *same* instruction. *Always!*

```
1 __global__ void kernel (float *result, float *input)
2 {
3    int i = threadIdx.x + blockDim.x*blockIdx.x;
4    if (input[i]>0)
5       result[i] = 1.f;
6    else
7       result[i] = 0.f;
8 }
```





### branch divergence

# Reminder: All 32 threads of a warp execute the same instruction. Always!

```
1 __global__ void kernel (float *result, float *input)
2 {
3    int i = threadIdx.x + blockDim.x*blockIdx.x;
4    if (input[i]>0)
5       result[i] = 1.f;
6    else
7       result[i] = 0.f;
8 }
```







- Each path is taken by each thread.
- The execution of the warp is serialized.







- Each path is taken by each thread.
- The execution of the warp is serialized.







- Each path is taken by each thread.
- Threads that should take an other path are marked inactive.
- The execution of the warp is serialized.







- Each path is taken by each thread.
- Threads that should take an other path are marked inactive.
- The execution of the warp is serialized.



### Serialization cont.

- Also happens with the following statements: for, while, switch
- No divergence if all threads take the same path.



### Serialization cont.

- Also happens with the following statements: for, while, switch
- Worst case: 1 active thread, 31 inactive ⇒ performance is reduced to  $1/32 \approx 3\%$
- No divergence if all threads take the same path.



### Serialization cont.

- Also happens with the following statements: for, while, switch
- Worst case: 1 active thread, 31 inactive ⇒ performance is reduced to  $1/32 \approx 3\%$
- No divergence if all threads take the same path.

if 
$$(tid/32 == 0) {...}$$

### **Outline**

- - occupancy
  - data bandwith and instruction throughput
- - branch divergence
- Maximize memory throughput
  - pitched allocation for images





- one can allocate 2d images as 1d arrays and access in a linearized way: img[x+w\*y]
- for a 6\*3 float image, the addresses & img[x+6\*y] are

| 48 | 52 | 56 | 60 | 64 | 68 |
|----|----|----|----|----|----|
| 24 | 28 | 32 | 36 | 40 | 44 |
| 0  | 4  | 8  | 12 | 16 | 20 |



- one can allocate 2d images as 1d arrays and access in a linearized way: img[x+w\*y]
- this works, but is in general suboptimal for CUDA
- for a 6\*3 float image, the addresses & img[x+6\*y] are

| I | 48 | 52 | 56 | 60 | 64 | 68 |
|---|----|----|----|----|----|----|
| ı | 24 | 28 | 32 | 36 | 40 | 44 |
| ı | 0  | 4  | 8  | 12 | 16 | 20 |



- one can allocate 2d images as 1d arrays and access in a linearized way: img[x+w\*v]
- this works, but is in general suboptimal for CUDA
- for a 6\*3 float image, the addresses &img[x+6\*y] are

| 48 | 52 | 56 | 60 | 64 | 68 |
|----|----|----|----|----|----|
| 24 | 28 | 32 | 36 | 40 | 44 |
| 0  | 4  | 8  | 12 | 16 | 20 |

- one can allocate 2d images as 1d arrays and access in a linearized way: img[x+w\*y]
- this works, but is in general suboptimal for CUDA
- for a 6\*3 float image, the addresses &img[x+6\*y] are

| 48 | 52 | 56 | 60 | 64 | 68 |
|----|----|----|----|----|----|
| 24 | 28 | 32 | 36 | 40 | 44 |
| 0  | 4  | 8  | 12 | 16 | 20 |

read/write accesses are fastest when the starting address of each row is a multiple of a big power of 2. (most common: 128)





■ the total new width in bytes is called pitch

| 64 | 68 | 72 | 76 | 80 | 84 | 88 | 92 |
|----|----|----|----|----|----|----|----|
| 32 | 36 | 40 | 44 | 48 | 52 | 56 | 60 |
| 0  | 4  | 8  | 12 | 16 | 20 | 24 | 28 |

- here: pitch = 32 bytes (=8\*sizeof(float))
- in general pitch != multiple of element size
  - float3 array

adding padding bytes at the end of each row resolves this





on host: float \*d a: size t pitch; cudaMallocPitch(&d\_a, &pitch, w\*sizeof(float), h); in kernel: ■ Copying: cudaMemcpy2D(...)

For 3D-Data: cudaMalloc3D(...)





```
on host:
  float *d a:
  size t pitch;
  cudaMallocPitch(&d_a, &pitch, w*sizeof(float), h);
in kernel:
  float value =
  *((float*)( (char*)a + x*sizeof(float) + pitch*y) );
■ Copying: cudaMemcpy2D(...)
```

For 3D-Data: cudaMalloc3D(...)





#### on host:

```
float *d a:
size t pitch;
cudaMallocPitch(&d_a, &pitch, w*sizeof(float), h);
```

#### in kernel:

```
float value =
*((float*)( (char*)a + x*sizeof(float) + pitch*y) );
```

- Copying: cudaMemcpy2D(...)
- For 3D-Data: cudaMalloc3D(...)





### Outline

- 1 Performance metrics of algorithms running on a GPU
  - occupancy
  - data bandwith and instruction throughput
- Maximize instruction throughput
  - branch divergence
- 3 Maximize memory throughput
  - pitched allocation for images
- 4 parallel reduction: an example of optimization



Keep all the SM busy

each block reduces a part of the array but how do we communicate the partial results in an efficient way?



Want to process very large arrays

Keep all the SM busy

each block reduces a part of the array but how do we



Want to process very large arrays

Keep all the SM busy

each block reduces a part of the array but how do we communicate the partial results in an efficient way?



- solution: decompose into multiple kernels and use launch as synchronization point
- two different metrics of performance: bandwidth and GFLOP/s
- Reductions have low arithmetric intensity ⇒ bandwidth is the proper metric





- solution: decompose into multiple kernels and use launch as synchronization point
- two different metrics of performance: bandwidth and GFLOP/s
- Reductions have low arithmetric intensity ⇒ bandwidth is the proper metric

- CUDA does not have global synchronization
- solution: decompose into multiple kernels and use launch as synchronization point
- two different metrics of performance: bandwidth and GFLOP/s
- Reductions have low arithmetric intensity ⇒ bandwidth is the proper metric

- CUDA does not have global synchronization
- solution: decompose into multiple kernels and use launch as synchronization point
- two different metrics of performance: bandwidth and GFLOP/s
- Reductions have low arithmetric intensity ⇒ bandwidth is the proper metric



### A first implementation

```
__global__ void reduce0(int *g_idata, int *g_odata) {
    extern shared int sdata[];
3
    // each thread loads one element from global to shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    sdata[tid] = g_idata[i];
    __syncthreads();
9
10
    // do reduction in shared mem
    for(unsigned int s=1; s < blockDim.x; s *= 2) {</pre>
11
     if (tid % (2*s) == 0) {
       sdata[tid] += sdata[tid + s];
13
14
     __syncthreads();
15
16
17
    // write result for this block to global mem
18
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
19
    }
20
```



# A first implementation







# how can we accelerate the code?

hint: branch divergence



# how can we accelerate the code?

hint: branch divergence



#### Just replace divergent branch in inner loop:

```
for (unsigned int s=1; s < blockDim.x; s *= 2) {
    if (tid % (2*s) == 0) {
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}</pre>
```

#### With strided index and non-divergent branch:

```
for (unsigned int s=1; s < blockDim.x; s *= 2) {
  int index = 2 * s * tid;

  if (index < blockDim.x) {
     sdata[index] += sdata[index + s];
  }
  __syncthreads();
}</pre>
```



- This is already better, but still we can improve a lot.
- Let's take a closer look at the shared memory:
  - On modern GPUs the shared memory is divided into 32 banks.
  - Adresses in different banks can be read at the same time.
  - If different threads within a warp want to read different adresses from a single bank, the accesses are executed in serial.
  - This is commonly referred to as a bank conflict



- This is already better, but still we can improve a lot.
- Let's take a closer look at the shared memory:
  - On modern GPUs the shared memory is divided into 32 banks.
  - Adresses in different banks can be read at the same time.
  - If different threads within a warp want to read different adresses from a single bank, the accesses are executed in serial.
  - This is commonly referred to as a bank conflict







### After a few additional optimizations, this is the final speed up:

|                                                                 | Time (2 <sup>22</sup> ints) | Bandwidth   | Step<br>Speedup | Cumulative<br>Speedup |
|-----------------------------------------------------------------|-----------------------------|-------------|-----------------|-----------------------|
| Kernel 1:<br>interleaved addressing<br>with divergent branching | 8.054 ms                    | 2.083 GB/s  |                 |                       |
| Kernel 2:<br>interleaved addressing<br>with bank conflicts      | 3.456 ms                    | 4.854 GB/s  | 2.33x           | 2.33x                 |
| Kernel 3:<br>sequential addressing                              | 1.722 ms                    | 9.741 GB/s  | 2.01x           | 4.68x                 |
| Kernel 4:<br>first add during global load                       | 0.965 ms                    | 17.377 GB/s | 1.78x           | 8.34x                 |
| Kernel 5:<br>unroll last warp                                   | 0.536 ms                    | 31.289 GB/s | 1.8x            | 15.01x                |
| Kernel 6:<br>completely unrolled                                | 0.381 ms                    | 43.996 GB/s | 1.41x           | 21.16x                |
| Kernel 7:<br>multiple elements per thread                       | 0.268 ms                    | 62.671 GB/s | 1.42x           | 30.04x                |

#### for the full details see:

http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86\_website/projects/reduction/doc/reduction.pdf