CUDA streams, texture binding and async memcpy
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
CUDA streams, texture access, and asynchronous copies are often discussed separately, but they only help when their ordering rules are understood together. The usual goal is to overlap host-to-device copies with kernel execution while still guaranteeing that a kernel reads valid data, whether it uses global memory or a texture object.
Streams control ordering
A CUDA stream is an ordered queue of work. Operations submitted to the same stream run in issue order. Operations in different streams may overlap if the GPU and memory subsystem support it.
The most common overlap pattern is:
- copy input in stream A
- launch kernel in stream A
- copy output in stream A
while stream B is doing the same thing for a different chunk of data.
That only works if the host buffers are pinned. With normal pageable memory, cudaMemcpyAsync often behaves like a blocking copy from the host point of view.
Texture binding is about how the kernel reads memory
Texture memory is not a separate storage space in the way many beginners imagine. A texture object is a read interface layered over existing device memory or CUDA arrays. You bind or create the texture on the host, then pass the handle to a kernel.
A simple linear-memory texture object looks like this:
The kernel can then read through the texture path:
This is useful when the access pattern benefits from the texture cache or from texture addressing features. For plain sequential reads, global memory with normal loads is often enough on modern GPUs.
Async copies and texture reads must still be synchronized
Texture access does not magically make an asynchronous copy safe. If stream A copies data into d_in and stream B launches a kernel that reads d_in through a texture object, you must create an explicit dependency.
One clean pattern is an event:
If you skip that dependency, the compute stream may start before the copy is finished and the kernel will read incomplete data.
Common Pitfalls
- Expecting
cudaMemcpyAsyncto overlap when host memory is pageable. UsecudaMallocHostor another pinned allocation path. - Launching work in the default stream and assuming it will overlap with everything else. The default stream has special synchronization behavior.
- Binding a texture object to memory that is still being filled by another stream without an event or stream-level dependency.
- Using texture objects for every read path. They help in specific access patterns, not universally.
- Forgetting to destroy texture objects and streams, which makes long-running programs harder to debug.
Summary
- Streams define execution order and make overlap possible when work is independent.
- True asynchronous copies require pinned host memory.
- Texture objects are a read interface over device memory or CUDA arrays, not a separate magical buffer.
- If one stream copies and another stream computes, add an event or use the same stream.
- Optimize with measurements; streams and textures are useful tools, but only when the access pattern justifies them.

