CUDA streams, texture binding and async memcpy

CUDA

GPU programming

parallel computing

texture binding

asynchronous memory transfer

CUDA streams, texture binding and async memcpy

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

CUDA streams, texture access, and asynchronous copies are often discussed separately, but they only help when their ordering rules are understood together. The usual goal is to overlap host-to-device copies with kernel execution while still guaranteeing that a kernel reads valid data, whether it uses global memory or a texture object.

Streams control ordering

A CUDA stream is an ordered queue of work. Operations submitted to the same stream run in issue order. Operations in different streams may overlap if the GPU and memory subsystem support it.

The most common overlap pattern is:

copy input in stream A
launch kernel in stream A
copy output in stream A

while stream B is doing the same thing for a different chunk of data.

That only works if the host buffers are pinned. With normal pageable memory, cudaMemcpyAsync often behaves like a blocking copy from the host point of view.

cpp

1#include <cuda_runtime.h>
2#include <iostream>
3
4__global__ void square_kernel(const float* in, float* out, int n) {
5    int i = blockIdx.x * blockDim.x + threadIdx.x;
6    if (i < n) {
7        out[i] = in[i] * in[i];
8    }
9}
10
11int main() {
12    const int n = 1 << 20;
13    const size_t bytes = n * sizeof(float);
14
15    float* h_in = nullptr;
16    float* h_out = nullptr;
17    float* d_in = nullptr;
18    float* d_out = nullptr;
19    cudaStream_t stream;
20
21    cudaMallocHost(&h_in, bytes);
22    cudaMallocHost(&h_out, bytes);
23    cudaMalloc(&d_in, bytes);
24    cudaMalloc(&d_out, bytes);
25    cudaStreamCreate(&stream);
26
27    for (int i = 0; i < n; ++i) {
28        h_in[i] = static_cast<float>(i);
29    }
30
31    cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, stream);
32    square_kernel<<<(n + 255) / 256, 256, 0, stream>>>(d_in, d_out, n);
33    cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream);
34    cudaStreamSynchronize(stream);
35
36    std::cout << h_out[10] << "\n";
37
38    cudaStreamDestroy(stream);
39    cudaFree(d_in);
40    cudaFree(d_out);
41    cudaFreeHost(h_in);
42    cudaFreeHost(h_out);
43}

Texture binding is about how the kernel reads memory

Texture memory is not a separate storage space in the way many beginners imagine. A texture object is a read interface layered over existing device memory or CUDA arrays. You bind or create the texture on the host, then pass the handle to a kernel.

A simple linear-memory texture object looks like this:

cpp

1cudaResourceDesc res_desc = {};
2res_desc.resType = cudaResourceTypeLinear;
3res_desc.res.linear.devPtr = d_in;
4res_desc.res.linear.desc = cudaCreateChannelDesc<float>();
5res_desc.res.linear.sizeInBytes = bytes;
6
7cudaTextureDesc tex_desc = {};
8tex_desc.readMode = cudaReadModeElementType;
9
10cudaTextureObject_t tex = 0;
11cudaCreateTextureObject(&tex, &res_desc, &tex_desc, nullptr);

The kernel can then read through the texture path:

cpp

1__global__ void texture_kernel(cudaTextureObject_t tex, float* out, int n) {
2    int i = blockIdx.x * blockDim.x + threadIdx.x;
3    if (i < n) {
4        out[i] = tex1Dfetch<float>(tex, i);
5    }
6}

This is useful when the access pattern benefits from the texture cache or from texture addressing features. For plain sequential reads, global memory with normal loads is often enough on modern GPUs.

Async copies and texture reads must still be synchronized

Texture access does not magically make an asynchronous copy safe. If stream A copies data into d_in and stream B launches a kernel that reads d_in through a texture object, you must create an explicit dependency.

One clean pattern is an event:

cpp

1cudaEvent_t ready;
2cudaEventCreate(&ready);
3
4cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, copy_stream);
5cudaEventRecord(ready, copy_stream);
6
7cudaStreamWaitEvent(compute_stream, ready, 0);
8texture_kernel<<<(n + 255) / 256, 256, 0, compute_stream>>>(tex, d_out, n);

If you skip that dependency, the compute stream may start before the copy is finished and the kernel will read incomplete data.

Common Pitfalls

Expecting cudaMemcpyAsync to overlap when host memory is pageable. Use cudaMallocHost or another pinned allocation path.
Launching work in the default stream and assuming it will overlap with everything else. The default stream has special synchronization behavior.
Binding a texture object to memory that is still being filled by another stream without an event or stream-level dependency.
Using texture objects for every read path. They help in specific access patterns, not universally.
Forgetting to destroy texture objects and streams, which makes long-running programs harder to debug.

Summary

Streams define execution order and make overlap possible when work is independent.
True asynchronous copies require pinned host memory.
Texture objects are a read interface over device memory or CUDA arrays, not a separate magical buffer.
If one stream copies and another stream computes, add an event or use the same stream.
Optimize with measurements; streams and textures are useful tools, but only when the access pattern justifies them.