Warp and Subgroup Functions

A warp is a hardware-scheduled group of threads that share a common execution context and execute instructions collectively, while individual threads may be active or inactive (masked) due to control-flow divergence. Threads in a warp can exchange values via warp shuffle functions without going through shared memory. A thread block may contain multiple warps. The number of threads within a thread block is not required to be a multiple of the warp size. Threads in different warps cannot use warp shuffle functions to exchange values. A warp is always a one-dimensional group of threads, even within n-dimensional kernels.

When to Reach for Warp Functions

Use warp functions when:

you want fast communication among threads that execute in lock-step
you are implementing a reduction or prefix-style pattern inside a warp
you need ballot-style voting or lane-to-lane value exchange.

A Warp Reduction With `shflDown`

The following example reduces one value per lane to one value per warp. makeIdxMap uses warps directly instead of first mapping at the thread-block level.

struct WarpSumKernel
{
    /** Warp kernel
     *
     * This kernel assumes that `in` and `out` are one-dimensional.
     * The requires clause enforces this constraint.
     */
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IDataSource auto const& in,
        concepts::IMdSpan auto out) const
        requires(concepts::Dim<ALPAKA_TYPEOF(in), 1u> && concepts::Dim<ALPAKA_TYPEOF(out), 1u>)
    {
        auto const warpSize = onAcc::warp::getSize(acc);
        auto const idxInWarp = onAcc::warp::getLaneIdx(acc);
        auto const workSize = pCast<uint32_t>(in.getExtents());

        // This example requires that the work size is a multiple of the warp size.
        ALPAKA_ASSERT_ACC((workSize.x() % warpSize) == 0u);

        for(auto [blockBase] :
            onAcc::makeIdxMap(acc, onAcc::worker::linearWarpsInGrid, IdxRange{0u, workSize, warpSize}))
        {
            auto value = in[Vec{blockBase + idxInWarp}];
            for(uint32_t offset = warpSize / 2u; offset > 0u; offset /= 2u)
                value += onAcc::warp::shflDown(acc, value, offset);

            if(onAcc::warp::getLaneIdx(acc) == 0u)
            {
                out[blockBase / warpSize] = value;
            }
        }
    }
};

Important rules:

All participating threads must call the same warp intrinsic in a compatible control-flow region.
Use the actual warp size reported by the accelerator instead of hard-coding 32, which is typical for NVIDIA devices.
On host devices, the warp size can be 1. The code still compiles and runs, but the subgroup behavior is naturally trivial there.

Other warp functions:

onAcc::warp::shfl to broadcast from a chosen lane
onAcc::warp::shflUp read from the lower lane
onAcc::warp::shflXor xor the read value from a lane with its own
onAcc::warp::all and onAcc::warp::any for voting between participating warp threads
onAcc::warp::ballot for predicate masks

Complete Source File

190_warp.cpp

/* Copyright 2026 René Widera
 * SPDX-License-Identifier: ISC
 */

#include "docsTest.hpp"

#include <alpaka/alpaka.hpp>

#include <catch2/catch_template_test_macros.hpp>
#include <catch2/catch_test_macros.hpp>

#include <vector>

using namespace alpaka;

struct WarpSumKernel
{
    /** Warp kernel
     *
     * This kernel assumes that `in` and `out` are one-dimensional.
     * The requires clause enforces this constraint.
     */
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IDataSource auto const& in,
        concepts::IMdSpan auto out) const
        requires(concepts::Dim<ALPAKA_TYPEOF(in), 1u> && concepts::Dim<ALPAKA_TYPEOF(out), 1u>)
    {
        auto const warpSize = onAcc::warp::getSize(acc);
        auto const idxInWarp = onAcc::warp::getLaneIdx(acc);
        auto const workSize = pCast<uint32_t>(in.getExtents());

        // This example requires that the work size is a multiple of the warp size.
        ALPAKA_ASSERT_ACC((workSize.x() % warpSize) == 0u);

        for(auto [blockBase] :
            onAcc::makeIdxMap(acc, onAcc::worker::linearWarpsInGrid, IdxRange{0u, workSize, warpSize}))
        {
            auto value = in[Vec{blockBase + idxInWarp}];
            for(uint32_t offset = warpSize / 2u; offset > 0u; offset /= 2u)
                value += onAcc::warp::shflDown(acc, value, offset);

            if(onAcc::warp::getLaneIdx(acc) == 0u)
            {
                out[blockBase / warpSize] = value;
            }
        }
    }
};


TEMPLATE_LIST_TEST_CASE("tutorial warp shuffle reduction", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);
    auto const warpSize = device.getDeviceProperties().warpSize;

    auto const blocks = 2u;

    std::vector<uint32_t> hostInput(blocks * warpSize);
    std::vector<uint32_t> hostOutput(blocks, 0u);
    std::vector<uint32_t> expectedOutput(blocks, 0u);

    for(uint32_t blockIdx = 0; blockIdx < blocks; ++blockIdx)
    {
        for(uint32_t laneIdx = 0; laneIdx < warpSize; ++laneIdx)
        {
            auto const value = blockIdx * warpSize + laneIdx + 1u;
            hostInput[blockIdx * warpSize + laneIdx] = value;
            expectedOutput[blockIdx] += value;
        }
    }

    auto inputBuffer = onHost::allocLike(device, hostInput);
    auto outputBuffer = onHost::allocLike(device, hostOutput);

    onHost::memcpy(queue, inputBuffer, hostInput);
    onHost::memset(queue, outputBuffer, 0x00);

    onHost::concepts::FrameSpec auto frameSpec = onHost::FrameSpec{Vec{blocks}, Vec{warpSize}, exec};
    queue.enqueue(frameSpec, KernelBundle{WarpSumKernel{}, inputBuffer, outputBuffer});

    onHost::memcpy(queue, hostOutput, outputBuffer);
    onHost::wait(queue);

    for(uint32_t blockIdx = 0; blockIdx < blocks; ++blockIdx)
        CHECK(hostOutput[blockIdx] == expectedOutput[blockIdx]);
}

Warp and Subgroup Functions

When to Reach for Warp Functions

A Warp Reduction With shflDown

Complete Source File

A Warp Reduction With `shflDown`