Memory Fences

onAcc::memFence is a visibility and ordering primitive inside kernels. It is not a barrier, therefore not wait for other threads to reach the same point. Instead, it tells the backend how data writes before the fence must become visible relative to data reads and writes after the fence.

With a scope you define between which thread hierarchy levels the visibility guarantee applies.

onAcc::scope::block for communication inside one thread block,
onAcc::scope::device for communication across blocks on the same device.

With an order you specify the ordering guarantee of the memory operations.

Block-Scope Ordering

The first example manages thread block shared data ordering without atomics. One thread publishes two values in shared memory. The fence guarantees that the write to shared[0] becomes visible before the later write to shared[1] is observed as published. On a device with API host and device kind cpu there is no possibility that wrong values are read because the number of threads within a thread block is always one. On parallel devices, e.g. a GPU, the ordering guarantee is required.

template<uint32_t T_chunkSize>
struct BlockFenceKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto successFlag) const
    {
        auto shared = onAcc::declareSharedMdArray<int, uniqueId()>(acc, CVec<uint32_t, T_chunkSize>{});

        // only one thread is setting intial conditions
        for([[maybe_unused]] auto _ : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{1u}))
        {
            shared[0] = 1;
            shared[1] = 2;
        }

        // guarantee that all threads see the initialized shared data
        onAcc::syncBlockThreads(acc);

        for(auto [tid] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{T_chunkSize}))
        {
            // producer
            if(tid == 0u)
            {
                shared[0] = 10;
                onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release);
                shared[1] = 20;
            }

            auto observedB = shared[1];
            onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
            auto observedA = shared[0];

            // (observedA, observedB) must be (10, 20) or (1, 2)
            if(observedA == 1 && observedB == 20)
            {
                // The following case can never happen
                onAcc::atomicExch(acc, &successFlag[0u], 0u);
            }
        }
    }
};

The kernel start with a frame spec and the initialization of the success flag are shown below.

auto successFlag = onHost::allocUnified<uint32_t>(device, 1u);
successFlag[0u] = 1u;

queue.enqueue(onHost::FrameSpec{1u, 2u, exec}, KernelBundle{BlockFenceKernel<2u>{}, successFlag});

Device-Scope Publication

The second example shows the classic producer/consumer publication pattern in global memory. The producer writes the payload, issues a release fence, and only then atomically sets a ready flag. The consumer spins on the atomic ready flag, issues an acquire fence, and then reads the payload. This example intentionally uses ThreadSpec instead of FrameSpec because the algorithm needs an exact guarantee about how many thread blocks and threads are launched.

struct ProducerConsumerFenceKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto payload,
        concepts::IMdSpan auto readyFlag,
        concepts::IMdSpan auto mismatchCounter) const
    {
        auto [tid] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);

        if(!(tid == 0u || tid == 2u))
        {
            return;
        }

        if(tid == 0u)
        {
            payload[0u] = 42u;
            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release);
            onAcc::atomicExch(acc, &readyFlag[0u], 1u);
        }
        else
        {
            while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
            {
            }

            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::acquire);
            if(payload[0u] != 42u)
            {
                onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u);
            }
        }
    }
};

queue.enqueue(
    onHost::ThreadSpec{3u, 1u, exec},
    KernelBundle{ProducerConsumerFenceKernel{}, payload, readyFlag, mismatchCounter});

Used pattern:

producer: write data, memFence(..., scope::device, order::release), then atomically publish the flag
consumer: atomically observe the flag, memFence(..., scope::device, order::acquire), then read the data

Practical Advice

A fence orders memory operations; it does not make conflicting non-atomic writes safe.
Keep the publication protocol simple: payload first, fence second, atomic flag update last.
For best performance use scope::block over scope::device when block-local visibility is enough.
Use the weakest memory order that expresses the algorithm clearly. release / acquire is often the right pair for producer/consumer publication.
The meaning stays the same across backends, but the runtime cost can differ.

Complete Source File

180_memFence.cpp

/* Copyright 2026 René Widera
 * SPDX-License-Identifier: ISC
 */
#include "docsTest.hpp"

#include <alpaka/alpaka.hpp>

#include <catch2/catch_template_test_macros.hpp>
#include <catch2/catch_test_macros.hpp>

using namespace alpaka;

template<uint32_t T_chunkSize>
struct BlockFenceKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto successFlag) const
    {
        auto shared = onAcc::declareSharedMdArray<int, uniqueId()>(acc, CVec<uint32_t, T_chunkSize>{});

        // only one thread is setting intial conditions
        for([[maybe_unused]] auto _ : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{1u}))
        {
            shared[0] = 1;
            shared[1] = 2;
        }

        // guarantee that all threads see the initialized shared data
        onAcc::syncBlockThreads(acc);

        for(auto [tid] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInBlock, IdxRange{T_chunkSize}))
        {
            // producer
            if(tid == 0u)
            {
                shared[0] = 10;
                onAcc::memFence(acc, onAcc::scope::block, onAcc::order::release);
                shared[1] = 20;
            }

            auto observedB = shared[1];
            onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
            auto observedA = shared[0];

            // (observedA, observedB) must be (10, 20) or (1, 2)
            if(observedA == 1 && observedB == 20)
            {
                // The following case can never happen
                onAcc::atomicExch(acc, &successFlag[0u], 0u);
            }
        }
    }
};


struct ProducerConsumerFenceKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto payload,
        concepts::IMdSpan auto readyFlag,
        concepts::IMdSpan auto mismatchCounter) const
    {
        auto [tid] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);

        if(!(tid == 0u || tid == 2u))
        {
            return;
        }

        if(tid == 0u)
        {
            payload[0u] = 42u;
            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release);
            onAcc::atomicExch(acc, &readyFlag[0u], 1u);
        }
        else
        {
            while(onAcc::atomicCas(acc, &readyFlag[0u], 0u, 0u) == 0u)
            {
            }

            onAcc::memFence(acc, onAcc::scope::device, onAcc::order::acquire);
            if(payload[0u] != 42u)
            {
                onAcc::atomicAdd(acc, &mismatchCounter[0u], 1u);
            }
        }
    }
};


TEMPLATE_LIST_TEST_CASE("tutorial memFence block scope", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    auto successFlag = onHost::allocUnified<uint32_t>(device, 1u);
    successFlag[0u] = 1u;

    queue.enqueue(onHost::FrameSpec{1u, 2u, exec}, KernelBundle{BlockFenceKernel<2u>{}, successFlag});

    onHost::wait(queue);
    CHECK(successFlag[0u] == 1u);
}

TEMPLATE_LIST_TEST_CASE("tutorial memFence device scope", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    auto payload = onHost::alloc<uint32_t>(device, Vec{1u});
    auto readyFlag = onHost::alloc<uint32_t>(device, Vec{1u});
    auto mismatchCounter = onHost::alloc<uint32_t>(device, Vec{1u});

    auto readyInit = onHost::allocHostLike(readyFlag);
    auto mismatchInit = onHost::allocHostLike(mismatchCounter);
    readyInit[0u] = 0u;
    mismatchInit[0u] = 0u;

    onHost::memcpy(queue, readyFlag, readyInit);
    onHost::memcpy(queue, mismatchCounter, mismatchInit);

    queue.enqueue(
        onHost::ThreadSpec{3u, 1u, exec},
        KernelBundle{ProducerConsumerFenceKernel{}, payload, readyFlag, mismatchCounter});

    onHost::memcpy(queue, mismatchInit, mismatchCounter);
    onHost::wait(queue);

    CHECK(mismatchInit[0u] == 0u);
}