CUDA Stream和CUDA Event

CUDA Context

在介绍Stream和event之前，首先应该介绍一下Context。Context与Runtime API或Driver API并无直接关联。Context的描述可以在CUDA手册的Driver API章节找到。Context在CS领域一般译为上下文，在学操作系统的时候有所谓的上下文切换，这里的上下文指的是操作系统为进程或者线程(对于Linux来说，单位是task)维护的一套数据结构，可以包括PCB、计数器地址、文件描述符(UNIX)等，使得CPU可以在切换后完成该线程或进程的执行。CUDA program类似，也一样有其所谓的上下文，只不过这里的context是GPU维护的一套维持CUDA程序运行的数据结构，可能包括GPU分配的内存地址、内核函数运行状态、Host/Device内存地址映射关系等。与操作系统的context相同的是，一般来说CUDA context无需程序员自行创建与管理，在runtime API内也没有显式的context控制相关接口，GPU驱动可以为我们自动维护程序所需的上下文，只有使用更为底层的driver api才能获得对context完全控制权。

在使用Runtime API时，当我们使用cudaInitDevice()和cudaSetDevice()这类API时，runtime会为我们创建一个与设备对应的primary context，这个primary context是与进程相关的。即对所有的host thread共享，不论该thread使用了多少stream，也不论有多少线程在同时并发执行。只要其PID一致，操作的设备一致，runtime API只会维护一份context。

但是使用Driver API，我们可以创建多份context，但对于每个线程(每个task)。有且仅有一个current context，当我们在一个线程内使用cuCtxCreate()创建context时，会立刻将该创建的context作为current context，如果后续我们操作了非currrent context的上下文，cuda程序会报错。可以想象为，每个host thread都有一个context栈，有且仅有一个栈顶context即current context，我们可以手动的使用Driver API为一个线程配置不同的context。

简单来说，使用常用的Runtime API，每个进程、每个设备对应一份Primary Context。使用更为底层的Driver API，每个task可以有多个Context但有且只有一个Current Context。

使用[知乎大佬](https://zhuanlan.zhihu.com/p/694214348)画的一张图展示

CUDA Stream

首先，Stream是Runtime API内的一个概念。理解Stream，我们才能理解如何使用CUDA在Host侧完成并发控制或异步执行，而这些也正是提高CUDA进程吞吐量与效率的关键之一。在CUDA手册中，一个CUDA Stream被定义为：允许程序员去完成一序列指令的抽象概念。

1	`At the most basic level, a CUDA stream is an abstraction which allows the programmer to express a sequence of operations.`

Stream可以类比为一个队列，编程人员可以向其中入队指令(例如memcpy、kernel launch等)，先入队的指令会先被执行并出队，单个Stream内的指令执行顺序与其入队顺序相同。当然，我们可以同时使用多个Stream，Stream可以被设定有不同的优先级，但最终是CUDA Runtime来根据GPU当前的资源与Stream优先级情况来决定哪个Stream的指令应该被执行。换句话说，Runtime为我们实现Stream Schedule，Runtime并不能保证优先级高的Stream内指令一定先于低的Stream执行。

Default Stream

当我们没有显式声明Stream时，CUDA会默认使用Default Stream。Default Stream非常重要，它也是之后如何区分阻塞型Stream和非阻塞型Stream概念的关键。如果在使用nvcc时没有传递特殊的编译flag，那么CUDA会启用Legacy Default Stream(也可以称作NULL stream)。所有Host线程共享这个Legacy Default Stream。并且，该Legacy Default Stream是默认阻塞的。所有在Legacy Default Stream内的发出的指令，都会与其他所有阻塞型stream内的指令进行同步。可以参考cuda手册内提供的示例代码，如下：

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block>>>(...);
kernel3<<<grid, block, 0, stream2>>>(...);
cudaDeviceSynchronize();

当我们不带参数创建stream时，其默认与Legacy Default Stream阻塞(同步)，也就是所谓的阻塞型Stream。上述代码内的kernel1、kernel2、kernel3并不会并发运行，会等待上一个kernel结束后下一个kernel才会正常launch。不过当我们带上cudaStreamNonBlocking的flag之后，创建的Stream也就不会与Legacy Default Stream同步，kernel也就可以并发运行，如下述代码所示:

cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
kernel1<<<grid, block, 0, stream1>>>(...);
kernel2<<<grid, block>>>(...);
kernel3<<<grid, block, 0, stream2>>>(...);
cudaDeviceSynchronize();

需要特别指明的是，Legacy Default Stream是所有Host线程共享的。当我们在不同线程不显式声明Stream时，launch的指令并不会并发，哪怕host侧的代码是并发的，但GPU执行并不会并发。可以在编译时启用--default-stream per-thread或使用MacroCUDA_API_PER_THREAD_DEFAULT_STREAM来启用per-thread default stream，这时每个host线程都有一个自己的非阻塞型默认stream，在不同线程launch的指令会默认进行并发。

Stream间的同步

可以使用一些显式同步API，例如：

cudaDeviceSynchronize(): 等待所有Stream、所有Host线程的默认Stream(Legacy或非Lagacy)的所有指令完成；
cudaStreamSynchronize()：传入一个Stream作为参数，等待该Stream的所有指令完成；
cudaStreamWaitEvent()：传入Stream和Event作为参数，在使用该api后的所有指令都必须等待传入的event完成；
cudaStreamQuery()：查询一个Stream内的指令是否已经完成执行；

当然，我们在代码中使用Legacy Default Stream也可以做到隐式同步，正如之前的代码示例一样。当多个kernel可以并发执行时，可以创建一些阻塞型stream，并使用Legacy Default Stream的代码进行同步，等待这些并发的kernel执行结束再进行有顺序的kenrel发射。
再次强调，Stream可以保证在一个Stream内的指令是按顺序进行的。使用多个Stream的用途是让多个指令在Device侧overlap，GPU会考虑overlap资源的分配。而我们使用一些异步api(例如memcpyAsync)的目的一般是在host侧代码执行无需等待device侧的完成。例如下列代码：

1 2	`cudaMemcpyAsync(...); kernel1<<<grid, block>>>(...);`

kernel1的执行一定会在memcpy完成之后进行，但是使用async，我们可以在memcpy后进行一些host侧的代码执行，host侧的代码执行可能会和GPU的memcpy overlap，从而提高代码运行效率。

1
2
3

cudaMemcpyAsync(...);
doSomeHostStuff(...);  //overlap with device memcpy
kernel1<<<grid, block>>>(...);

CUDA Event

CUDA event是和Stream配合使用的一个重要工具。event为runtime API提供了Stream内指令级别颗粒度的追踪与同步方式。当我们使用Stream时，可以通过上面介绍的一些显式同步API来确定一个Stream内的指令是否全部完成，无法确定某一条特定的指令是否完成。虽然Stream可以保证指令的执行顺序，但在host侧可能需要指令颗粒度级别的追踪(例如，在某条指令结束后进行一些Host侧的操作，单独使用Stream无法实现对某单独一条指令的完成情况进行追踪)。

以下述代码为例，当我们在kernel1 launch之后，在Stream中添加一个event，之后使用cudaEventSynchronize()同步API，可以做到在Stream内完成对kernel1的同步操作，只有当kernel1完成执行后，后续的dependentCPUtask()才会运行，此时host侧的代码可以和kernel2完成overlap。

cudaEvent_t event;
cudaStream_t stream;

// create the stream
cudaStreamCreate(&stream);
// create the event
cudaEventCreate(&event);
// launch a kernel into the stream
kernel1<<<grid, block, 0, stream>>>(...);
// Record the event
cudaEventRecord(event, stream);
// launch a kernel into the stream
kernel2<<<grid, block, 0, stream>>>(...);

// Wait for the event to complete
// Kernel 1 will be guaranteed to have completed
// and we can launch the dependent task.
cudaEventSynchronize(event);
dependentCPUtask();
// Wait for the stream to be empty
// Kernel 2 is guaranteed to have completed
cudaStreamSynchronize(stream);
// destroy the event
cudaEventDestroy(event);
// destroy the stream
cudaStreamDestroy(stream);

当然，上述代码也可以使用cudaLaunchHostFunc()完成，参考CUDA Runtime API手册，cudaLaunchHostFunc()可以在Stream以指令形式插入一则host函数，该函数内不应该有任何CUDA Runtime API的调用。此种方法相比上述方法更灵活并且不需要在Host侧显式的完成同步，但代价是无法让HostFunc与后续的GPU指令overlap，Runtime API会完成GPU指令与该Host函数的同步与执行顺序分配。关于该函数的更多信息请参考API手册，如果使用该API改写上述使用event的代码，如下：

cudaStream_t stream;

// create the stream
cudaStreamCreate(&stream);
// launch a kernel into the stream
kernel1<<<grid, block, 0, stream>>>(...);
// launch the host callback function, this host function will execute after kernel1 had completed.
// In other words, stream will be joined after kenrel1 complete
// and when the host function executing, stream could be treated as idle.
cudaLaunchHostFunc(stream, dependentCPUtask, somedata);
// lanmch kernel2， Warning: kernel2 will be execute after host task had been done.
kernel2<<<grid, block, 0, stream>>>(...);

// Wait for the stream to be empty
// Kernel 2 is guaranteed to have completed
cudaStreamSynchronize(stream);
// destroy the event
cudaEventDestroy(event);
// destroy the stream
cudaStreamDestroy(stream);

同时，enevt还能用于追踪kernel的执行时间，或者用于在ncu/nsys内追踪内核执行情况等。

综合使用Stream与Event来创建DAG

显然，Stream和Event给了我们一个在代码内创建一系列按照顺序执行指令的机会，可以使用多条Stream并伴随Stream间的Event同步，我们就可以创建一个以GPU指令(或使用相关API创建的Host Function)为节点的有向无环图(DAG)，一张DAG可能会在代码内执行多次。CUDA为这类DAG设计专门设计了CUDA Graph功能，用于在单个应用内捕获一次DAG并允许其搞笑的执行多次。CUDA Graph比较复杂，cuda手册甚至单独是用了一章特地介绍，在此我们只讨论如何使用Stream和Event来搭建一个类似DAG的流图代码。

一个很经典的DAG，有多个节点与通路

上图展示了一个DAG，假设每个节点都是一个kernel函数(不考虑内存分配、内存拷贝)，首先需要确定stream的数量。图中同时并发的节点最多有3个，因此我们需要3个Stream。单个stream内的节点执行顺序不需要我们关心，只要按照顺序来执行kernel即可。但对于不同stream之间的依赖与先后关系，需要靠event来进行同步。话不多说直接上代码：


cudaStream_t stream1; // left path (A->X->Y->End)
cudaStream_t stream2; // middle path (...->D->...)
cudaStream_t stream3; // right path (...->B->C->E->...)

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);

kernelA<<<grid, block, 0, stream1>>>(...);

// insert a event into stream1
cudaEvent_t event_wait_A;
cudaEventCreate(&event_wait_A);
cudaEventRecord(event_wait_A, stream1);

// X and Y node need to execute after A node
// but they are in the same stream with A
// so it don't need events to sychronize
kernelX<<<grid, block, 0, stream1>>>(...);
kernelY<<<grid, block, 0, stream1>>>(...);

cudaStreamWaitEvent(stream3, event_wait_A);

kernelB<<<grid, block, 0, stream3>>>(...);
kernelC<<<grid, block, 0, stream3>>>(...);
// insert a event into stream3
cudaEvent_t event_wait_B;
cudaEventCreate(&event_wait_B);
cudaEventRecord(event_wait_B, stream3);

cudaStreamWaitEvent(stream2, event_wait_B);
kernelD<<<grid, block, 0, stream2>>>(...);

cudaEvent_t event_wait_D;
cudaEventCreate(&event_wait_D);
cudaEventRecord(event_wait_D, stream2);

cudaStreamWaitEvent(stream3, event_wait_D);
kernelE<<<grid, block, 0, stream3>>>(...);

cudaEvent_t event_wait_E;
cudaEventCreate(&event_wait_E);
cudaEventRecord(event_wait_E, stream3);

cudaStreamWaitEvent(stream1, event_wait_E);
kernelEnd<<<grid, block, 0, stream1>>>(...);

编程随笔 > AI相关

#Artificial Intelligence #CUDA

CUDA Stream和CUDA Event

https://blog.bakeneko-kuro.com/2025/12/15/hpc/cuda-stream-and-cuda-events/

作者

迷途黑猫

发布于

2025年12月15日

许可协议

使用NVIDIA modelopt代替pt-quantization进行模型量化并使用TRT导出INT8模型上一篇

CUDA中的Bank Conflict与Memory Coalesce 下一篇