Acquire/release semantics with non-temporal stores on x64

Acquire/release semantics

non-temporal stores

x64 architecture

memory ordering

concurrency control

Acquire/release semantics with non-temporal stores on x64

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

On x64, ordinary write-back loads and stores already give you the usual acquire-like and release-like ordering guarantees that C and C++ rely on. Non-temporal stores are the exception: Intel documents them as weakly ordered, so you cannot assume that a stream store by itself behaves like a normal release store.

Why Normal x64 Stores Feel Easy

For ordinary write-back memory, x86-64 has a strong memory model. In practice, this means:

normal stores are not reordered with other stores
normal loads are not reordered with other loads
a normal store is not reordered with an earlier load

That is why an atomic store with release ordering and an atomic load with acquire ordering usually compile to plain mov instructions on x64 for ordinary memory.

Non-Temporal Stores Change the Story

Non-temporal stores such as MOVNTI or the _mm_stream_* intrinsics are designed for streaming writes that should avoid polluting the cache. Intel describes them as similar to write-combining stores, and the important part for synchronization is that they are weakly ordered.

So if a producer thread does this:

writes a data buffer with non-temporal stores
sets a ready flag

you cannot assume another core will see the buffer writes before the flag unless you add the right ordering step.

The Usual Correct Pattern

The standard solution is:

perform the non-temporal stores
execute SFENCE
publish a normal release flag

Example in C++:

cpp

1#include <atomic>
2#include <emmintrin.h>
3
4alignas(16) int buffer[4];
5std::atomic<int> ready{0};
6
7void producer() {
8    __m128i data = _mm_set_epi32(4, 3, 2, 1);
9    _mm_stream_si128(reinterpret_cast<__m128i*>(buffer), data);
10
11    _mm_sfence();
12    ready.store(1, std::memory_order_release);
13}
14
15void consumer() {
16    while (ready.load(std::memory_order_acquire) == 0) {
17    }
18
19    int first = buffer[0];
20    int second = buffer[1];
21}

The reason this works is:

'SFENCE orders the weakly ordered non-temporal stores and pushes them toward visibility'
the later release store publishes "data is ready"
the acquire load on the consumer side prevents the compiler from moving dependent reads before the flag load, and x64 already gives the needed hardware ordering for normal loads

Why `SFENCE` Matters

Intel's optimization guidance explicitly warns that streaming stores are weakly ordered and require fencing for coherent visibility. Without SFENCE, the writes may still be sitting in write-combining buffers when the flag becomes visible to another thread.

That creates the exact bug acquire/release synchronization is supposed to prevent: the consumer observes the signal but not all the data the signal is meant to publish.

What You Do Not Usually Need

On the consumer side, you normally do not need LFENCE just to pair with a standard acquire load from the ready flag, assuming the data is later read with ordinary loads from normal cacheable memory.

The tricky part is the producer side because non-temporal stores are the weakly ordered operation. Once the producer has fenced them and then performed a normal release store to the flag, the consumer can usually stick to standard acquire logic.

A Useful Rule of Thumb

Treat non-temporal stores as performance tools, not synchronization tools.

If you are publishing data to another thread:

write the data with non-temporal stores if streaming makes sense
fence the stream stores with SFENCE
publish readiness with a normal atomic release store
consume readiness with a normal atomic acquire load

That keeps synchronization attached to ordinary atomics while using non-temporal stores only for the bulk data movement.

Common Pitfalls

The most common mistake is assuming that "x64 has release ordering by default" automatically covers non-temporal stores. It does not. Intel documents those stores as weakly ordered.

Another error is setting the ready flag immediately after the stream stores with no fence. That can let another core observe the flag before the data is globally visible.

People also sometimes over-fence on the consumer side with LFENCE or MFENCE when the real missing step is the producer-side SFENCE.

Finally, keep compiler ordering in mind. If you write low-level synchronization code without atomics or compiler barriers, the compiler can still rearrange source-level operations even when the hardware would have been strong enough.

Summary

Ordinary x64 loads and stores already support the usual acquire/release-style ordering for normal memory.
Non-temporal stores are weakly ordered and should not be treated as release stores.
Use SFENCE after non-temporal stores before publishing a ready flag.
Publish readiness with a normal atomic release store and observe it with an atomic acquire load.
Use non-temporal stores for streaming performance, not as a replacement for proper synchronization.

Acquire/release semantics with non-temporal stores on x64

Master System Design with Codemia

Introduction

Why Normal x64 Stores Feel Easy

Non-Temporal Stores Change the Story

The Usual Correct Pattern

Why SFENCE Matters

What You Do Not Usually Need

A Useful Rule of Thumb

Common Pitfalls

Summary

Why `SFENCE` Matters