Memory Fences
onAcc::memFence is a visibility and ordering primitive inside kernels.
It is not a barrier, therefore not wait for other threads to reach the same point.
Instead, it tells the backend how data writes before the fence must become visible relative to data reads and writes after the fence.
With a scope you define between which thread hierarchy levels the visibility guarantee applies.
onAcc::scope::blockfor communication inside one thread block,onAcc::scope::devicefor communication across blocks on the same device.
With an order you specify the ordering guarantee of the memory operations.
Block-Scope Ordering
The first example manages thread block shared data ordering without atomics.
One thread publishes two values in shared memory.
The fence guarantees that the write to shared[0] becomes visible before the later write to shared[1] is observed as published.
On a device with API host and device kind cpu there is no possibility that wrong values are read because the number of threads within a thread block is always one.
On parallel devices, e.g. a GPU, the ordering guarantee is required.
template<uint32_t T_chunkSize> struct BlockFenceKernel { ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto successFlag) const { auto shared = onAcc::declareSharedMdArray<int, uniqueId()>(acc, CVec<uint32_t, T_chunkSize>{}); // only one thread is setting intial conditions for([[maybe_unused]] auto _ : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{1u})) { shared[0] = 1; shared[1] = 2; } // guarantee that all threads see the initialized shared data onAcc::syncBlockThreads(acc); for(auto [tid] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{T_chunkSize})) { // producer if(tid == 0u) { shared[0] = 10; onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release); shared[1] = 20; } auto observedB = shared[1]; onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire); auto observedA = shared[0]; // (observedA, observedB) must be (10, 20) or (1, 2) if(observedA == 1 && observedB == 20) { // The following case can never happen onAcc::atomicExch(acc, &successFlag[0u], 0u); } } } };
The kernel start with a frame spec and the initialization of the success flag are shown below.
auto successFlag = onHost::allocUnified<uint32_t>(device, 1u); successFlag[0u] = 1u; queue.enqueue(onHost::FrameSpec{1u, 2u, exec}, KernelBundle{BlockFenceKernel<2u>{}, successFlag});
Device-Scope Publication
The second example shows the classic producer/consumer publication pattern in global memory.
The producer writes the payload, issues a release fence, and only then atomically sets a ready flag.
The consumer spins on the atomic ready flag, issues an acquire fence, and then reads the payload.
This example intentionally uses ThreadSpec instead of FrameSpec because the algorithm needs an exact guarantee
about how many thread blocks and threads are launched.
struct ProducerConsumerFenceKernel { ALPAKA_FN_ACC void operator()( onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto payload, concepts::IMdSpan auto readyFlag, concepts::IMdSpan auto mismatchCounter) const { auto [tid] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads); if(!(tid == 0u || tid == 2u)) { return; } if(tid == 0u) { payload[0u] = 42u; onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release); onAcc::atomicExch(acc, &readyFlag[0u], 1u); } else { while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u) { } onAcc::memFence(acc, onAcc::scope::device, onAcc::order::acquire); if(payload[0u] != 42u) { onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u); } } } };queue.enqueue( onHost::ThreadSpec{3u, 1u, exec}, KernelBundle{ProducerConsumerFenceKernel{}, payload, readyFlag, mismatchCounter});
Used pattern:
producer: write data,
memFence(..., scope::device, order::release), then atomically publish the flagconsumer: atomically observe the flag,
memFence(..., scope::device, order::acquire), then read the data
Practical Advice
A fence orders memory operations; it does not make conflicting non-atomic writes safe.
Keep the publication protocol simple: payload first, fence second, atomic flag update last.
For best performance use
scope::blockoverscope::devicewhen block-local visibility is enough.Use the weakest memory order that expresses the algorithm clearly.
release/acquireis often the right pair for producer/consumer publication.The meaning stays the same across backends, but the runtime cost can differ.
Complete Source File
180_memFence.cpp
1/* Copyright 2026 René Widera
2 * SPDX-License-Identifier: ISC
3 */
4#include "docsTest.hpp"
5
6#include <alpaka/alpaka.hpp>
7
8#include <catch2/catch_template_test_macros.hpp>
9#include <catch2/catch_test_macros.hpp>
10
11using namespace alpaka;
12
13template<uint32_t T_chunkSize>
14struct BlockFenceKernel
15{
16 ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto successFlag) const
17 {
18 auto shared = onAcc::declareSharedMdArray<int, uniqueId()>(acc, CVec<uint32_t, T_chunkSize>{});
19
20 // only one thread is setting intial conditions
21 for([[maybe_unused]] auto _ : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{1u}))
22 {
23 shared[0] = 1;
24 shared[1] = 2;
25 }
26
27 // guarantee that all threads see the initialized shared data
28 onAcc::syncBlockThreads(acc);
29
30 for(auto [tid] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{T_chunkSize}))
31 {
32 // producer
33 if(tid == 0u)
34 {
35 shared[0] = 10;
36 onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release);
37 shared[1] = 20;
38 }
39
40 auto observedB = shared[1];
41 onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
42 auto observedA = shared[0];
43
44 // (observedA, observedB) must be (10, 20) or (1, 2)
45 if(observedA == 1 && observedB == 20)
46 {
47 // The following case can never happen
48 onAcc::atomicExch(acc, &successFlag[0u], 0u);
49 }
50 }
51 }
52};
53
54
55struct ProducerConsumerFenceKernel
56{
57 ALPAKA_FN_ACC void operator()(
58 onAcc::concepts::Acc auto const& acc,
59 concepts::IMdSpan auto payload,
60 concepts::IMdSpan auto readyFlag,
61 concepts::IMdSpan auto mismatchCounter) const
62 {
63 auto [tid] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
64
65 if(!(tid == 0u || tid == 2u))
66 {
67 return;
68 }
69
70 if(tid == 0u)
71 {
72 payload[0u] = 42u;
73 onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release);
74 onAcc::atomicExch(acc, &readyFlag[0u], 1u);
75 }
76 else
77 {
78 while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
79 {
80 }
81
82 onAcc::memFence(acc, onAcc::scope::device, onAcc::order::acquire);
83 if(payload[0u] != 42u)
84 {
85 onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u);
86 }
87 }
88 }
89};
90
91
92TEMPLATE_LIST_TEST_CASE("tutorial memFence block scope", "[docs]", docs::test::TestBackends)
93{
94 auto cfg = TestType::makeDict();
95 auto deviceSpec = cfg[object::deviceSpec];
96 auto exec = cfg[object::exec];
97
98 auto selector = onHost::makeDeviceSelector(deviceSpec);
99 if(!selector.isAvailable())
100 return;
101 onHost::concepts::Device auto device = selector.makeDevice(0);
102 onHost::Queue queue = device.makeQueue(queueKind::blocking);
103
104 auto successFlag = onHost::allocUnified<uint32_t>(device, 1u);
105 successFlag[0u] = 1u;
106
107 queue.enqueue(onHost::FrameSpec{1u, 2u, exec}, KernelBundle{BlockFenceKernel<2u>{}, successFlag});
108
109 onHost::wait(queue);
110 CHECK(successFlag[0u] == 1u);
111}
112
113TEMPLATE_LIST_TEST_CASE("tutorial memFence device scope", "[docs]", docs::test::TestBackends)
114{
115 auto cfg = TestType::makeDict();
116 auto deviceSpec = cfg[object::deviceSpec];
117 auto exec = cfg[object::exec];
118
119 auto selector = onHost::makeDeviceSelector(deviceSpec);
120 if(!selector.isAvailable())
121 return;
122 onHost::concepts::Device auto device = selector.makeDevice(0);
123 onHost::Queue queue = device.makeQueue(queueKind::blocking);
124
125 auto payload = onHost::alloc<uint32_t>(device, Vec{1u});
126 auto readyFlag = onHost::alloc<uint32_t>(device, Vec{1u});
127 auto mismatchCounter = onHost::alloc<uint32_t>(device, Vec{1u});
128
129 auto readyInit = onHost::allocHostLike(readyFlag);
130 auto mismatchInit = onHost::allocHostLike(mismatchCounter);
131 readyInit[0u] = 0u;
132 mismatchInit[0u] = 0u;
133
134 onHost::memcpy(queue, readyFlag, readyInit);
135 onHost::memcpy(queue, mismatchCounter, mismatchInit);
136
137 queue.enqueue(
138 onHost::ThreadSpec{3u, 1u, exec},
139 KernelBundle{ProducerConsumerFenceKernel{}, payload, readyFlag, mismatchCounter});
140
141 onHost::memcpy(queue, mismatchInit, mismatchCounter);
142 onHost::wait(queue);
143
144 CHECK(mismatchInit[0u] == 0u);
145}