Kernel Function

In the kernel tutorial you learned that an alpaka kernel is represented by a functor. Internally, a kernel launch always invokes a functor, but alpaka also provides a convenient way to implement kernels as free functions. This approach offers a simple syntax for specializing a kernel implementation for a specific API or device kind while still using the same kernel interface from the application code.

This functionality is provided by alpaka’s generic function interface.

Writing a Kernel as a Function

This example uses the simple one-dimensional vector-add kernel from the kernel tutorial.

Instead of defining a functor with operator(), you implement a function named fnDispatch. The first argument must be an alpaka function symbol or a function-symbol specialization, which allows alpaka to select the correct implementation at compile time. Function symbols and all corresponding fnDispatch overloads must be placed inside a namespace. alpaka relies on Argument-Dependent Lookup (ADL) to find matching overloads, and functions located in the global namespace cannot be found reliably.

The following example defines a generic vector-add kernel implementation that can run on any supported backend.

namespace tutorial
{

    /* The function symbol is only defined without specifying the argument signature.
     * You need to provide at least a generic function dispatch signature for the symbol.     */
    ALPAKA_FN_SYMBOL(VectorAdd);

    // Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
    ALPAKA_FN_ACC void fnDispatch(
        VectorAdd,
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        concepts::IDataSource auto const& lhs,
        concepts::IDataSource auto const& rhs)
    {
        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());

        auto [globalThreadIdx] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
        if(globalThreadIdx == 0u)
            printf("The alpaka kernel implementation is running\n");

        for(auto [i] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            out[i] = lhs[i] + rhs[i];
        }
    }
} // namespace tutorial

Launching a kernel function is identical to launching a kernel functor. The function symbol behaves like a stateless functor with a default constructor and can therefore be passed directly to a KernelBundle.

onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, Vec{numElements});
// An alpaka function symbol behaves like a functor with a default constructor.
queue.enqueue(frameSpec, KernelBundle{tutorial::VectorAdd{}, resultBuffer, lhsBuffer, rhsBuffer});
onHost::wait(queue);

Specializing a Kernel Function for CUDA

In some situations you may want to provide a backend-specific implementation. Typical reasons include:

using native backend features that are not available through the portable abstraction
achieving maximum performance for a specific platform
gradually porting existing CUDA code to alpaka

This can be achieved by overloading fnDispatch for a specific API and device kind. When a kernel is launched on a matching backend, alpaka automatically selects the specialized overload.

The example below provides a CUDA-specific implementation using native CUDA thread and grid indices.

Because this implementation depends on CUDA-specific language features such as blockIdx and threadIdx, it must be guarded with a pre-compiler macro ALPAKA_LANG_CUDA. Without this guard, compilation would fail on systems where CUDA support is not enabled.

#if ALPAKA_LANG_CUDA
namespace tutorial
{
    /* If the kernel is enqueued to a CUDA queue this overload is used as kernel.
     * It is required to guard the code with preprocessor guard ALPAKA_LANG_CUDA because the kernel is not genericdue
     * to the usage of CUDA build in variables.
     */
    template<typename T_DeviceKind>
    ALPAKA_FN_ACC void fnDispatch(
        VectorAdd::Spec<api::Cuda, T_DeviceKind>,
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        concepts::IDataSource auto const& lhs,
        concepts::IDataSource auto const& rhs)
    {
        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());

        std::size_t const globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
        std::size_t const stride = blockDim.x * gridDim.x;

        if(globalThreadIdx == 0u)
            printf("The native CUDA kernel implementation is running\n");


        for(std::size_t i = globalThreadIdx; i < out.getExtents().x(); i += stride)
        {
            out[i] = lhs[i] + rhs[i];
        }
    }
} // namespace tutorial
#endif

Overload Resolution

When a kernel is launched, alpaka selects the most specialized matching fnDispatch overload.

For example:

fnDispatch(VectorAdd, ...) provides a generic implementation.
fnDispatch(VectorAdd::Spec<api::Cuda, T_DeviceKind>, ...) provides a CUDA-specific implementation.
fnDispatch(VectorAdd::Spec<api::OneApi, deviceKind::IntelGpu>, ...) provides a implementation for the Intel GPU accessed via OneApi. (not used in this example)

If the kernel is executed on a queue for an CUDA device, the CUDA specialization is selected. For all other backends, the generic implementation is used automatically. This allows application code to remain unchanged while backend-specific optimizations are added where needed.

Complete Source File

230_kernelFn.cpp

/* Copyright 2026 René Widera
 * SPDX-License-Identifier: ISC
 */

#include "docsTest.hpp"

#include <alpaka/alpaka.hpp>

#include <catch2/catch_template_test_macros.hpp>
#include <catch2/catch_test_macros.hpp>

#include <numeric>
#include <vector>

using namespace alpaka;

namespace tutorial
{

    /* The function symbol is only defined without specifying the argument signature.
     * You need to provide at least a generic function dispatch signature for the symbol.     */
    ALPAKA_FN_SYMBOL(VectorAdd);

    // Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
    ALPAKA_FN_ACC void fnDispatch(
        VectorAdd,
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        concepts::IDataSource auto const& lhs,
        concepts::IDataSource auto const& rhs)
    {
        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());

        auto [globalThreadIdx] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
        if(globalThreadIdx == 0u)
            printf("The alpaka kernel implementation is running\n");

        for(auto [i] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            out[i] = lhs[i] + rhs[i];
        }
    }
} // namespace tutorial


#if ALPAKA_LANG_CUDA
namespace tutorial
{
    /* If the kernel is enqueued to a CUDA queue this overload is used as kernel.
     * It is required to guard the code with preprocessor guard ALPAKA_LANG_CUDA because the kernel is not genericdue
     * to the usage of CUDA build in variables.
     */
    template<typename T_DeviceKind>
    ALPAKA_FN_ACC void fnDispatch(
        VectorAdd::Spec<api::Cuda, T_DeviceKind>,
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        concepts::IDataSource auto const& lhs,
        concepts::IDataSource auto const& rhs)
    {
        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());

        std::size_t const globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
        std::size_t const stride = blockDim.x * gridDim.x;

        if(globalThreadIdx == 0u)
            printf("The native CUDA kernel implementation is running\n");


        for(std::size_t i = globalThreadIdx; i < out.getExtents().x(); i += stride)
        {
            out[i] = lhs[i] + rhs[i];
        }
    }
} // namespace tutorial
#endif

TEMPLATE_LIST_TEST_CASE("tutorial kernel as function vector add", "[docs]", docs::test::TestBackends)
{
    auto selector = onHost::makeDeviceSelector(TestType::makeDict()[object::deviceSpec]);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue();

    /* When you use `makeIdxMap`, your algorithm is not subject to any restrictions regarding data size, such as the
     * requirement that the data must be a power of two or a multiple of 100.
     */
    size_t numElements = 293u;
    auto lhsBuffer = onHost::alloc<int>(device, numElements);
    // Allocate buffers on the compute device.
    auto rhsBuffer = onHost::allocLike(device, lhsBuffer);
    auto resultBuffer = onHost::allocLike(device, lhsBuffer);

    std::vector<int> lhs(numElements);
    std::vector<int> rhs(numElements);
    std::iota(lhs.begin(), lhs.end(), 0);
    std::iota(rhs.begin(), rhs.end(), 1000);
    std::vector<int> result(lhs.size(), -1);

    // Copy input data to the device.
    onHost::memcpy(queue, lhsBuffer, lhs);
    onHost::memcpy(queue, rhsBuffer, rhs);
    onHost::memset(queue, resultBuffer, 0x00);

    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, Vec{numElements});
    // An alpaka function symbol behaves like a functor with a default constructor.
    queue.enqueue(frameSpec, KernelBundle{tutorial::VectorAdd{}, resultBuffer, lhsBuffer, rhsBuffer});
    onHost::wait(queue);

    // Copy the result back and wait for completion before reading it.
    onHost::memcpy(queue, result, resultBuffer);
    onHost::wait(queue);

    for(size_t i = 0; i < result.size(); ++i)
    {
        CHECK(result[i] == lhs[i] + rhs[i]);
    }
}