CUDA from entry to mastery: performance profiling and Visual Profiler


The further learning content after getting started is how to optimize your own code. Our previous examples do not consider any performance optimization, in order to better learn the basic knowledge points, rather than other details. Starting from this section, we should consider the problem from the perspective of performance, and constantly optimize the code to improve the execution speed is the only purpose of parallel processing.

There are many ways to test the running speed of code. C language provides an API similar to SystemTime() to obtain the system time, and then calculate the time between two events to complete the timing function. In CUDA, we have APIs for measuring the running time of equipment, which are introduced one by one below.


Open the CUDA_Toolkit_Reference_Manual and be ready to query if you don't know the API. Before and after running the kernel function, we do the following:

  1. cudaEvent_t start,stop;//Event object
  2. cudaEventCreate(&start);//Create event
  3. cudaEventCreate(&stop);//Create event
  4. cudaEventRecord(start,stream);//Record start
  5. myKernel<<<dimg,dimb,size_smem,stream>>>(parameter list);//Execution kernel function
  6. cudaEventRecord(stop,stream);//Record end event
  7. cudaEventSynchronize(stop);//Event synchronization, wait for the end of the event, and the device operations before the event have been completed
  8. float elapsedTime;
  9. cudaEventElapsedTime(&elapsedTime,start,stop);//Calculate the time between two events (unit: ms)



The kernel execution time will be saved in the variable elapsedTime. Through this value, we can evaluate the performance of the algorithm. Here is an example to see how to use the timing function.

The scale of the previous example is very small, only 5 elements, and the processing capacity is too small to time. Next, the scale will be expanded to 1024. In addition, it will be run repeatedly 1000 times to calculate the total time, so it is not easy to be affected by random disturbance. We use this example to compare the performance of thread parallelism and block parallelism. The code is as follows:

  1. #include "cuda_runtime.h"
  2. #include "device_launch_parameters.h"
  3. #include <stdio.h>
  4. cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);
  5. __global__ void addKernel_blk(int *c, const int *a, const int *b)
  6. {
  7. int i = blockIdx.x;
  8. c[i] = a[i]+ b[i];
  9. }
  10. __global__ void addKernel_thd(int *c, const int *a, const int *b)
  11. {
  12. int i = threadIdx.x;
  13. c[i] = a[i]+ b[i];
  14. }
  15. int main()
  16. {
  17. const int arraySize = 1024;
  18. int a[arraySize] = {0};
  19. int b[arraySize] = {0};
  20. for(int i = 0;i<arraySize;i++)
  21. {
  22. a[i] = i;
  23. b[i] = arraySize-i;
  24. }
  25. int c[arraySize] = {0};
  26. // Add vectors in parallel.
  27. cudaError_t cudaStatus;
  28. int num = 0;
  29. cudaDeviceProp prop;
  30. cudaStatus = cudaGetDeviceCount(&num);
  31. for(int i = 0;i<num;i++)
  32. {
  33. cudaGetDeviceProperties(&prop,i);
  34. }
  35. cudaStatus = addWithCuda(c, a, b, arraySize);
  36. if (cudaStatus != cudaSuccess)
  37. {
  38. fprintf(stderr, "addWithCuda failed!");
  39. return 1;
  40. }
  41. // cudaThreadExit must be called before exiting in order for profiling and
  42. // tracing tools such as Nsight and Visual Profiler to show complete traces.
  43. cudaStatus = cudaThreadExit();
  44. if (cudaStatus != cudaSuccess)
  45. {
  46. fprintf(stderr, "cudaThreadExit failed!");
  47. return 1;
  48. }
  49. for(int i = 0;i<arraySize;i++)
  50. {
  51. if(c[i] != (a[i]+b[i]))
  52. {
  53. printf("Error in %d\n",i);
  54. }
  55. }
  56. return 0;
  57. }
  58. // Helper function for using CUDA to add vectors in parallel.
  59. cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
  60. {
  61. int *dev_a = 0;
  62. int *dev_b = 0;
  63. int *dev_c = 0;
  64. cudaError_t cudaStatus;
  65. // Choose which GPU to run on, change this on a multi-GPU system.
  66. cudaStatus = cudaSetDevice(0);
  67. if (cudaStatus != cudaSuccess)
  68. {
  69. fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
  70. goto Error;
  71. }
  72. // Allocate GPU buffers for three vectors (two input, one output) .
  73. cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
  74. if (cudaStatus != cudaSuccess)
  75. {
  76. fprintf(stderr, "cudaMalloc failed!");
  77. goto Error;
  78. }
  79. cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
  80. if (cudaStatus != cudaSuccess)
  81. {
  82. fprintf(stderr, "cudaMalloc failed!");
  83. goto Error;
  84. }
  85. cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
  86. if (cudaStatus != cudaSuccess)
  87. {
  88. fprintf(stderr, "cudaMalloc failed!");
  89. goto Error;
  90. }
  91. // Copy input vectors from host memory to GPU buffers.
  92. cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
  93. if (cudaStatus != cudaSuccess)
  94. {
  95. fprintf(stderr, "cudaMemcpy failed!");
  96. goto Error;
  97. }
  98. cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
  99. if (cudaStatus != cudaSuccess)
  100. {
  101. fprintf(stderr, "cudaMemcpy failed!");
  102. goto Error;
  103. }
  104. cudaEvent_t start,stop;
  105. cudaEventCreate(&start);
  106. cudaEventCreate(&stop);
  107. cudaEventRecord(start,0);
  108. for(int i = 0;i<1000;i++)
  109. {
  110. // addKernel_blk<<<size,1>>>(dev_c, dev_a, dev_b);
  111. addKernel_thd<<<1,size>>>(dev_c, dev_a, dev_b);
  112. }
  113. cudaEventRecord(stop,0);
  114. cudaEventSynchronize(stop);
  115. float tm;
  116. cudaEventElapsedTime(&tm,start,stop);
  117. printf("GPU Elapsed time:%.6f ms.\n",tm);
  118. // cudaThreadSynchronize waits for the kernel to finish, and returns
  119. // any errors encountered during the launch.
  120. cudaStatus = cudaThreadSynchronize();
  121. if (cudaStatus != cudaSuccess)
  122. {
  123. fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
  124. goto Error;
  125. }
  126. // Copy output vector from GPU buffer to host memory.
  127. cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
  128. if (cudaStatus != cudaSuccess)
  129. {
  130. fprintf(stderr, "cudaMemcpy failed!");
  131. goto Error;
  132. }
  133. Error:
  134. cudaFree(dev_c);
  135. cudaFree(dev_a);
  136. cudaFree(dev_b);
  137. return cudaStatus;
  138. }


addKernel_blk is a vector addition operation implemented in block parallel, and addKernel_thd is a vector addition operation implemented in parallel by threads. Run separately, and the results are shown in the figure below:

Thread parallelism:

Block parallelism:


It can be seen that the performance is nearly 16 times different! Therefore, when selecting the parallel processing method, if the problem scale is not very large, it is more appropriate to adopt thread parallelism. When a large problem is processed in multiple thread blocks, the number of threads in each block should not be too small. For example, there is only one thread in this paper, which is a great waste of hardware resources. An ideal solution is to divide the problem into N thread blocks, each thread block contains 512 threads, and the efficiency is often much higher than that of single thread parallel processing or single block parallel processing. This is also the essence of CUDA Programming.


The above method of analyzing program performance is relatively rough. We only know the approximate running time. We don't have an in-depth understanding of the code execution time of each part of the device program, so we have a problem. If we optimize the code, which part should we optimize? Do you want to adjust the number of threads or use shared memory instead? The best solution to this problem is to use Visual Profiler. The following is taken from CUDA_Profiler_Users_Guide

"Visual Profiler is a graphical profiling tool that can display the activity of CPU and GPU in your application, and use the analysis engine to help you find optimization opportunities."

In fact, in addition to the visual interface, NVIDIA provides a command-line analysis command: nvprof. For beginners, it is easy to use graphical method, so this section uses Visual Profiler.


Open Visual Profiler, which can be found from CUDA Toolkit installation menu. The main interface is as follows:

Click file - > new session to open the new session dialog box, as shown in the following figure:

The File column is filled with the application exe File we need to analyze, which can be left blank (if command line parameters are required, they can be filled in the third line), and directly Next, as shown in the figure below:

The first line is the application execution timeout setting, which can be left blank; Check the last three radio boxes, so that we can enable analysis, enable concurrent kernel analysis, and then run the analyzer.

Click Finish to start running our application and analyzing its performance.

In the above figure, the CPU and GPU parts show the hardware and execution content information. Click an item to highlight the corresponding part of the time bar for easy observation. At the same time, the detailed information on the right will show the running time information. From the time bar, CUDA malloc takes up a large part of the time. The following analyzer gives some key points for performance improvement, including: low computing utilization (the computing time only accounts for 1.8% of the total time. No wonder the computational complexity of addition is very low!); Low memory copy / calculation overlap rate (no overlap at all, completely copy calculation copy); Low storage copy size (the amount of input data is too small, which is equivalent to that you bought a diary on Taobao, and the freight is higher than the real price!); Low storage copy throughput (only 1.55GB/s). These are very helpful for us to further optimize the program.


Let's click Details, which is next to the Analysis window. The results are as follows:


Through this window, you can see the execution time of each kernel function, the size of thread lattice and thread block, the number of registers occupied, the size of static shared memory and dynamic shared memory, as well as the execution of memory copy function. This provides a more accurate way to measure the time than the previous cudaEvent function. You can directly see the execution time of each step, accurate to ns.

There is also a Console behind the Details. Click to have a look.

This is actually the command line window, which displays the run output. After adding Profiler information, the total execution time becomes longer (the program running time of the original thread parallel version only takes about 4ms). This is also determined by the "uncertainty theorem". If we want to measure more subtle time, the total time must be inaccurate; If we want to measure the total time, the subtle time is ignored.


The following Settings are the parameter configuration when we establish a session, which will not be described in detail.


Through this section, we should be able to have some ideas on the improvement of CUDA performance. Well, in the next section, we will discuss how to optimize CUDA program.

Tags: CUDA

Posted by slyte33 on Tue, 24 May 2022 00:16:45 +0300