加速计算基础-CUDA C/C++

CUDA

发布日期: 2023-10-13

文章字数: 2.1k

阅读时长: 8 分

阅读次数:

一、利用基本的CUDA内存管理技术来优化加速应用程序

使用 CUDA C/C++ 加速应用程序

编写、编译及运行既可调用 CPU 函数也可启动 GPU 核函数 的 C/C++ 程序。
使用执行配置控制并行线程层次结构。
重构串行循环以在 GPU 上并行执行其迭代。
分配和释放可用于 CPU 和 GPU 的内存。
处理 CUDA 代码生成的错误。
加速 CPU 应用程序。

加速系统又称异构系统，由 CPU 和 GPU 组成。加速系统会运行 CPU 程序，这些程序也会转而启动将受益于 GPU 大规模并行计算能力的函数。
nvidia-smi **(*Systems Management Interface*) 命令行命令查询有关此 GPU 的信息。

!nvidia-smi

.cu 文件（.cu 是 CUDA 加速程序的文件扩展名）。

void CPUFunction()
{
  printf("This function is defined to run on the CPU.\n");
}

__global__ void GPUFunction()
{
  printf("This function is defined to run on the GPU.\n");
}

int main()
{
  CPUFunction();

  GPUFunction<<<1, 1>>>();
  cudaDeviceSynchronize();
}

__global__ void GPUFunction()

__global__ 关键字表明以下函数将在 GPU 上运行并可全局调用，而在此种情况下，则指由 CPU 或 GPU 调用。
通常，我们将在 CPU 上执行的代码称为主机代码，而将在 GPU 上运行的代码称为设备代码。
注意返回类型为 void。使用 __global__ 关键字定义的函数需要返回 void 类型。

GPUFunction<<<1, 1>>>(); 
<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>
//<<<第一个参数 线程块数  第二个参数 每个线程块的线程数>>>

通常，当调用要在 GPU 上运行的函数时，我们将此种函数称为已启动的核函数。
启动核函数时，我们必须提供执行配置，即在向核函数传递任何预期参数之前使用 <<< ... >>> 语法完成的配置。
在宏观层面，程序员可通过执行配置为核函数启动指定线程层次结构，从而定义线程组（称为线程块）的数量，以及要在每个线程块中执行的线程数量。现在请注意正在使用包含 1 线程（第二个配置参数）的 1 线程块（第一个执行配置参数）启动核函数。

cudaDeviceSynchronize(); //同步

与许多 C/C++ 代码不同，核函数启动方式为异步：CPU 代码将继续执行而无需等待核函数完成启动。
调用 CUDA 运行时提供的函数 cudaDeviceSynchronize 将导致主机 (CPU) 代码暂作等待，直至设备 (GPU) 代码执行完成，才能在 CPU 上恢复执行。

编写一个Hello GPU核函数

#include <stdio.h>

void helloCPU()
{
  printf("Hello from the CPU.\n");
}

/*
 * Refactor the `helloGPU` definition to be a kernel
 * that can be launched on the GPU. Update its message
 * to read "Hello from the GPU!"
 */

__global__ void helloGPU()
{
  printf("Hello also from the GPU.\n");
}

int main()
{
  helloCPU();
  /*
   * Refactor this call to `helloGPU` so that it launches
   * as a kernel on the GPU.
   */

  helloGPU<<<1,1>>>();
  cudaDeviceSynchronize();

  /*
   * Add code below to synchronize on the completion of the
   * `helloGPU` kernel completion before continuing the CPU
   * thread.
   */
}

!nvcc -arch=sm_70 -o hello-gpu 01-hello/01-hello-gpu.cu -run

编译 `some-CUDA.cu` 文件：

nvcc -arch=sm_70 -o out some-CUDA.cu -run

nvcc 是使用 nvcc 编译器的命令行命令。
将 some-CUDA.cu 作为文件传递以进行编译。
o 标志用于指定编译程序的输出文件。
arch 标志表示该文件必须编译为哪个架构类型。本示例中，sm_70 将用于专门针对本实验运行的 Volta GPU 进行编译，但有意深究的用户可以参阅有关 arch 标志、虚拟架构特性和 GPU特性的文档。
为方便起见，提供 run 标志将执行已成功编译的二进制文件。

启动核函数时，核函数代码由每个已配置的线程块中的每个线程执行。

因此，如果假设已定义一个名为 someKernel 的核函数，则下列情况为真：

someKernel<<<1, 1>>() 配置为在具有单线程的单个线程块中运行后，将只运行一次。
someKernel<<<1, 10>>() 配置为在具有 10 线程的单个线程块中运行后，将运行 10 次。
someKernel<<<10, 1>>() 配置为在 10 个线程块（每个均具有单线程）中运行后，将运行 10 次。
someKernel<<<10, 10>>() 配置为在 10 个线程块（每个均具有 10 线程）中运行后，将运行 100 次。

线程和块的索引

每个线程在其线程块内部均会被分配一个索引，从 0 开始。此外，每个线程块也会被分配一个索引，并从 0 开始。正如线程组成线程块，线程块又会组成网格，而网格是 CUDA 线程层次结构中级别最高的实体。简言之，CUDA 核函数在由一个或多个线程块组成的网格中执行，且每个线程块中均包含相同数量的一个或多个线程。

CUDA 核函数可以访问能够识别如下两种索引的特殊变量：正在执行核函数的线程（位于线程块内）索引和线程所在的线程块（位于网格内）索引。这两种变量分别为 threadIdx.x 和 blockIdx.x。

下图中，blockId.x=1,threadIx.x=3

加速for循环

并非要顺次运行循环的每次迭代，而是让每次迭代都在自身线程中并行运行

#include <stdio.h>

/*
 * Refactor `loop` to be a CUDA Kernel. The new kernel should
 * only do the work of 1 iteration of the original loop.
 */

void loop(int N)
{
  for (int i = 0; i < N; ++i)
  {
    printf("This is iteration number %d\n", i);
  }
}

int main()
{
  /*
   * When refactoring `loop` to launch as a kernel, be sure
   * to use the execution configuration to control how many
   * "iterations" to perform.
   *
   * For this exercise, only use 1 block of threads.
   */

  int N = 10;
  loop(N);
}

#include <stdio.h>
/*
 * Notice the absence of the previously expected argument `N`.
 */
__global__ void loop()
{
  /*
   * This kernel does the work of only 1 iteration
   * of the original for loop. Indication of which
   * "iteration" is being executed by this kernel is
   * still available via `threadIdx.x`.
   */
  printf("This is iteration number %d\n", threadIdx.x);
}

int main()
{
  /*
   * It is the execution context that sets how many "iterations"
   * of the "loop" will be done.
   */

  loop<<<1, 10>>>();
  cudaDeviceSynchronize();
}

协调并行线程

调整线程块的大小以实现更多的并行化

线程块包含的线程具有数量限制：确切地说是 1024 个。为增加加速应用程序中的并行量，我们必须要能在多个线程块之间进行协调。

CUDA 核函数可以访问给出块中线程数的特殊变量：blockDim.x。通过将此变量与 blockIdx.x 和 threadIdx.x 变量结合使用，并借助惯用表达式 threadIdx.x + blockIdx.x * blockDim.x 在包含多个线程的多个线程块之间组织并行执行，并行性将得以提升。以下是详细示例。

执行配置 <<<10, 10>>> 将启动共计拥有 100 个线程的网格，这些线程均包含在由 10 个线程组成的 10 个线程块中。因此，我们希望每个线程（0 至 99 之间）都能计算该线程的某个唯一索引。

如果线程块 blockIdx.x 等于 0，则 blockIdx.x * blockDim.x 为 0。向 0 添加可能的 threadIdx.x 值（0 至 9），之后便可在包含 100 个线程的网格内生成索引 0 至 9。
如果线程块 blockIdx.x 等于 1，则 blockIdx.x * blockDim.x 为 10。向 10 添加可能的 threadIdx.x 值（0 至 9），之后便可在包含 100 个线程的网格内生成索引 10 至 19。
如果线程块 blockIdx.x 等于 5，则 blockIdx.x * blockDim.x 为 50。向 50 添加可能的 threadIdx.x 值（0 至 9），之后便可在包含 100 个线程的网格内生成索引 50 至 59。
如果线程块 blockIdx.x 等于 9，则 blockIdx.x * blockDim.x 为 90。向 90 添加可能的 threadIdx.x 值（0 至 9），之后便可在包含 100 个线程的网格内生成索引 90 至 99。

加速具有多个线程块的For循环

#include <stdio.h>

__global__ void loop()
{
  /*
   * This idiomatic expression gives each thread
   * a unique index within the entire grid.
   */

  int i = blockIdx.x * blockDim.x + threadIdx.x;
  printf("%d\n", i);
}

int main()
{
  /*
   * Additional execution configurations that would
   * work and meet the exercises contraints are:
   *
   * <<<5, 2>>>
   * <<<10, 1>>>
   */

  loop<<<2, 5>>>();
  cudaDeviceSynchronize();
}