Memory Fences

onAcc::memFence is a visibility and ordering primitive inside kernels. It is not a barrier, therefore not wait for other threads to reach the same point. Instead, it tells the backend how data writes before the fence must become visible relative to data reads and writes after the fence.

With a scope you define between which thread hierarchy levels the visibility guarantee applies.

  • onAcc::scope::block for communication inside one thread block,

  • onAcc::scope::device for communication across blocks on the same device.

With an order you specify the ordering guarantee of the memory operations.

Block-Scope Ordering

The first example manages thread block shared data ordering without atomics. One thread publishes two values in shared memory. The fence guarantees that the write to shared[0] becomes visible before the later write to shared[1] is observed as published. On a device with API host and device kind cpu there is no possibility that wrong values are read because the number of threads within a thread block is always one. On parallel devices, e.g. a GPU, the ordering guarantee is required.

template<uint32_t T_chunkSize>
struct BlockFenceKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto successFlag) const
    {
        auto shared = onAcc::declareSharedMdArray<int, uniqueId()>(acc, CVec<uint32_t, T_chunkSize>{});

        // only one thread is setting intial conditions
        for([[maybe_unused]] auto _ : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{1u}))
        {
            shared[0] = 1;
            shared[1] = 2;
        }

        // guarantee that all threads see the initialized shared data
        onAcc::syncBlockThreads(acc);

        for(auto [tid] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{T_chunkSize}))
        {
            // producer
            if(tid == 0u)
            {
                shared[0] = 10;
                onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release);
                shared[1] = 20;
            }

            auto observedB = shared[1];
            onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
            auto observedA = shared[0];

            // (observedA, observedB) must be (10, 20) or (1, 2)
            if(observedA == 1 && observedB == 20)
            {
                // The following case can never happen
                onAcc::atomicExch(acc, &successFlag[0u], 0u);
            }
        }
    }
};

The kernel start with a frame spec and the initialization of the success flag are shown below.

auto successFlag = onHost::allocUnified<uint32_t>(device, 1u);
successFlag[0u] = 1u;

queue.enqueue(onHost::FrameSpec{1u, 2u, exec}, KernelBundle{BlockFenceKernel<2u>{}, successFlag});

Device-Scope Publication

The second example shows the classic producer/consumer publication pattern in global memory. The producer writes the payload, issues a release fence, and only then atomically sets a ready flag. The consumer spins on the atomic ready flag, issues an acquire fence, and then reads the payload. This example intentionally uses ThreadSpec instead of FrameSpec because the algorithm needs an exact guarantee about how many thread blocks and threads are launched.

struct ProducerConsumerFenceKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto payload,
        concepts::IMdSpan auto readyFlag,
        concepts::IMdSpan auto mismatchCounter) const
    {
        auto [tid] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);

        if(!(tid == 0u || tid == 2u))
        {
            return;
        }

        if(tid == 0u)
        {
            payload[0u] = 42u;
            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release);
            onAcc::atomicExch(acc, &readyFlag[0u], 1u);
        }
        else
        {
            while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
            {
            }

            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::acquire);
            if(payload[0u] != 42u)
            {
                onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u);
            }
        }
    }
};
queue.enqueue(
    onHost::ThreadSpec{3u, 1u, exec},
    KernelBundle{ProducerConsumerFenceKernel{}, payload, readyFlag, mismatchCounter});

Used pattern:

  • producer: write data, memFence(..., scope::device, order::release), then atomically publish the flag

  • consumer: atomically observe the flag, memFence(..., scope::device, order::acquire), then read the data

Practical Advice

  • A fence orders memory operations; it does not make conflicting non-atomic writes safe.

  • Keep the publication protocol simple: payload first, fence second, atomic flag update last.

  • For best performance use scope::block over scope::device when block-local visibility is enough.

  • Use the weakest memory order that expresses the algorithm clearly. release / acquire is often the right pair for producer/consumer publication.

  • The meaning stays the same across backends, but the runtime cost can differ.

Complete Source File

180_memFence.cpp
  1/* Copyright 2026 René Widera
  2 * SPDX-License-Identifier: ISC
  3 */
  4#include "docsTest.hpp"
  5
  6#include <alpaka/alpaka.hpp>
  7
  8#include <catch2/catch_template_test_macros.hpp>
  9#include <catch2/catch_test_macros.hpp>
 10
 11using namespace alpaka;
 12
 13template<uint32_t T_chunkSize>
 14struct BlockFenceKernel
 15{
 16    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto successFlag) const
 17    {
 18        auto shared = onAcc::declareSharedMdArray<int, uniqueId()>(acc, CVec<uint32_t, T_chunkSize>{});
 19
 20        // only one thread is setting intial conditions
 21        for([[maybe_unused]] auto _ : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{1u}))
 22        {
 23            shared[0] = 1;
 24            shared[1] = 2;
 25        }
 26
 27        // guarantee that all threads see the initialized shared data
 28        onAcc::syncBlockThreads(acc);
 29
 30        for(auto [tid] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{T_chunkSize}))
 31        {
 32            // producer
 33            if(tid == 0u)
 34            {
 35                shared[0] = 10;
 36                onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release);
 37                shared[1] = 20;
 38            }
 39
 40            auto observedB = shared[1];
 41            onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
 42            auto observedA = shared[0];
 43
 44            // (observedA, observedB) must be (10, 20) or (1, 2)
 45            if(observedA == 1 && observedB == 20)
 46            {
 47                // The following case can never happen
 48                onAcc::atomicExch(acc, &successFlag[0u], 0u);
 49            }
 50        }
 51    }
 52};
 53
 54
 55struct ProducerConsumerFenceKernel
 56{
 57    ALPAKA_FN_ACC void operator()(
 58        onAcc::concepts::Acc auto const& acc,
 59        concepts::IMdSpan auto payload,
 60        concepts::IMdSpan auto readyFlag,
 61        concepts::IMdSpan auto mismatchCounter) const
 62    {
 63        auto [tid] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
 64
 65        if(!(tid == 0u || tid == 2u))
 66        {
 67            return;
 68        }
 69
 70        if(tid == 0u)
 71        {
 72            payload[0u] = 42u;
 73            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release);
 74            onAcc::atomicExch(acc, &readyFlag[0u], 1u);
 75        }
 76        else
 77        {
 78            while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
 79            {
 80            }
 81
 82            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::acquire);
 83            if(payload[0u] != 42u)
 84            {
 85                onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u);
 86            }
 87        }
 88    }
 89};
 90
 91
 92TEMPLATE_LIST_TEST_CASE("tutorial memFence block scope", "[docs]", docs::test::TestBackends)
 93{
 94    auto cfg = TestType::makeDict();
 95    auto deviceSpec = cfg[object::deviceSpec];
 96    auto exec = cfg[object::exec];
 97
 98    auto selector = onHost::makeDeviceSelector(deviceSpec);
 99    if(!selector.isAvailable())
100        return;
101    onHost::concepts::Device auto device = selector.makeDevice(0);
102    onHost::Queue queue = device.makeQueue(queueKind::blocking);
103
104    auto successFlag = onHost::allocUnified<uint32_t>(device, 1u);
105    successFlag[0u] = 1u;
106
107    queue.enqueue(onHost::FrameSpec{1u, 2u, exec}, KernelBundle{BlockFenceKernel<2u>{}, successFlag});
108
109    onHost::wait(queue);
110    CHECK(successFlag[0u] == 1u);
111}
112
113TEMPLATE_LIST_TEST_CASE("tutorial memFence device scope", "[docs]", docs::test::TestBackends)
114{
115    auto cfg = TestType::makeDict();
116    auto deviceSpec = cfg[object::deviceSpec];
117    auto exec = cfg[object::exec];
118
119    auto selector = onHost::makeDeviceSelector(deviceSpec);
120    if(!selector.isAvailable())
121        return;
122    onHost::concepts::Device auto device = selector.makeDevice(0);
123    onHost::Queue queue = device.makeQueue(queueKind::blocking);
124
125    auto payload = onHost::alloc<uint32_t>(device, Vec{1u});
126    auto readyFlag = onHost::alloc<uint32_t>(device, Vec{1u});
127    auto mismatchCounter = onHost::alloc<uint32_t>(device, Vec{1u});
128
129    auto readyInit = onHost::allocHostLike(readyFlag);
130    auto mismatchInit = onHost::allocHostLike(mismatchCounter);
131    readyInit[0u] = 0u;
132    mismatchInit[0u] = 0u;
133
134    onHost::memcpy(queue, readyFlag, readyInit);
135    onHost::memcpy(queue, mismatchCounter, mismatchInit);
136
137    queue.enqueue(
138        onHost::ThreadSpec{3u, 1u, exec},
139        KernelBundle{ProducerConsumerFenceKernel{}, payload, readyFlag, mismatchCounter});
140
141    onHost::memcpy(queue, mismatchInit, mismatchCounter);
142    onHost::wait(queue);
143
144    CHECK(mismatchInit[0u] == 0u);
145}