Kernel Function

In the kernel tutorial you learned that an alpaka kernel is represented by a functor. Internally, a kernel launch always invokes a functor, but alpaka also provides a convenient way to implement kernels as free functions. This approach offers a simple syntax for specializing a kernel implementation for a specific API or device kind while still using the same kernel interface from the application code.

This functionality is provided by alpaka’s generic function interface.

Writing a Kernel as a Function

This example uses the simple one-dimensional vector-add kernel from the kernel tutorial.

Instead of defining a functor with operator(), you implement a function named fnDispatch. The first argument must be an alpaka function symbol or a function-symbol specialization, which allows alpaka to select the correct implementation at compile time. Function symbols and all corresponding fnDispatch overloads must be placed inside a namespace. alpaka relies on Argument-Dependent Lookup (ADL) to find matching overloads, and functions located in the global namespace cannot be found reliably.

The following example defines a generic vector-add kernel implementation that can run on any supported backend.

namespace tutorial
{

    /* The function symbol is only defined without specifying the argument signature.
     * You need to provide at least a generic function dispatch signature for the symbol.     */
    ALPAKA_FN_SYMBOL(VectorAdd);

    // Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
    ALPAKA_FN_ACC void fnDispatch(
        VectorAdd,
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        concepts::IDataSource auto const& lhs,
        concepts::IDataSource auto const& rhs)
    {
        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());

        auto [globalThreadIdx] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
        if(globalThreadIdx == 0u)
            printf("The alpaka kernel implementation is running\n");

        for(auto [i] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            out[i] = lhs[i] + rhs[i];
        }
    }
} // namespace tutorial

Launching a kernel function is identical to launching a kernel functor. The function symbol behaves like a stateless functor with a default constructor and can therefore be passed directly to a KernelBundle.

onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, Vec{numElements});
// An alpaka function symbol behaves like a functor with a default constructor.
queue.enqueue(frameSpec, KernelBundle{tutorial::VectorAdd{}, resultBuffer, lhsBuffer, rhsBuffer});
onHost::wait(queue);

Specializing a Kernel Function for CUDA

In some situations you may want to provide a backend-specific implementation. Typical reasons include:

  • using native backend features that are not available through the portable abstraction

  • achieving maximum performance for a specific platform

  • gradually porting existing CUDA code to alpaka

This can be achieved by overloading fnDispatch for a specific API and device kind. When a kernel is launched on a matching backend, alpaka automatically selects the specialized overload.

The example below provides a CUDA-specific implementation using native CUDA thread and grid indices.

Because this implementation depends on CUDA-specific language features such as blockIdx and threadIdx, it must be guarded with a pre-compiler macro ALPAKA_LANG_CUDA. Without this guard, compilation would fail on systems where CUDA support is not enabled.

#if ALPAKA_LANG_CUDA
namespace tutorial
{
    /* If the kernel is enqueued to a CUDA queue this overload is used as kernel.
     * It is required to guard the code with preprocessor guard ALPAKA_LANG_CUDA because the kernel is not genericdue
     * to the usage of CUDA build in variables.
     */
    template<typename T_DeviceKind>
    ALPAKA_FN_ACC void fnDispatch(
        VectorAdd::Spec<api::Cuda, T_DeviceKind>,
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        concepts::IDataSource auto const& lhs,
        concepts::IDataSource auto const& rhs)
    {
        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());

        std::size_t const globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
        std::size_t const stride = blockDim.x * gridDim.x;

        if(globalThreadIdx == 0u)
            printf("The native CUDA kernel implementation is running\n");


        for(std::size_t i = globalThreadIdx; i < out.getExtents().x(); i += stride)
        {
            out[i] = lhs[i] + rhs[i];
        }
    }
} // namespace tutorial
#endif

Overload Resolution

When a kernel is launched, alpaka selects the most specialized matching fnDispatch overload.

For example:

  • fnDispatch(VectorAdd, ...) provides a generic implementation.

  • fnDispatch(VectorAdd::Spec<api::Cuda, T_DeviceKind>, ...) provides a CUDA-specific implementation.

  • fnDispatch(VectorAdd::Spec<api::OneApi, deviceKind::IntelGpu>, ...) provides a implementation for the Intel GPU accessed via OneApi. (not used in this example)

If the kernel is executed on a queue for an CUDA device, the CUDA specialization is selected. For all other backends, the generic implementation is used automatically. This allows application code to remain unchanged while backend-specific optimizations are added where needed.

Complete Source File

230_kernelFn.cpp
  1/* Copyright 2026 René Widera
  2 * SPDX-License-Identifier: ISC
  3 */
  4
  5#include "docsTest.hpp"
  6
  7#include <alpaka/alpaka.hpp>
  8
  9#include <catch2/catch_template_test_macros.hpp>
 10#include <catch2/catch_test_macros.hpp>
 11
 12#include <numeric>
 13#include <vector>
 14
 15using namespace alpaka;
 16
 17namespace tutorial
 18{
 19
 20    /* The function symbol is only defined without specifying the argument signature.
 21     * You need to provide at least a generic function dispatch signature for the symbol.     */
 22    ALPAKA_FN_SYMBOL(VectorAdd);
 23
 24    // Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
 25    ALPAKA_FN_ACC void fnDispatch(
 26        VectorAdd,
 27        onAcc::concepts::Acc auto const& acc,
 28        concepts::IMdSpan auto out,
 29        concepts::IDataSource auto const& lhs,
 30        concepts::IDataSource auto const& rhs)
 31    {
 32        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
 33        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());
 34
 35        auto [globalThreadIdx] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
 36        if(globalThreadIdx == 0u)
 37            printf("The alpaka kernel implementation is running\n");
 38
 39        for(auto [i] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
 40        {
 41            out[i] = lhs[i] + rhs[i];
 42        }
 43    }
 44} // namespace tutorial
 45
 46
 47#if ALPAKA_LANG_CUDA
 48namespace tutorial
 49{
 50    /* If the kernel is enqueued to a CUDA queue this overload is used as kernel.
 51     * It is required to guard the code with preprocessor guard ALPAKA_LANG_CUDA because the kernel is not genericdue
 52     * to the usage of CUDA build in variables.
 53     */
 54    template<typename T_DeviceKind>
 55    ALPAKA_FN_ACC void fnDispatch(
 56        VectorAdd::Spec<api::Cuda, T_DeviceKind>,
 57        onAcc::concepts::Acc auto const& acc,
 58        concepts::IMdSpan auto out,
 59        concepts::IDataSource auto const& lhs,
 60        concepts::IDataSource auto const& rhs)
 61    {
 62        ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
 63        ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());
 64
 65        std::size_t const globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
 66        std::size_t const stride = blockDim.x * gridDim.x;
 67
 68        if(globalThreadIdx == 0u)
 69            printf("The native CUDA kernel implementation is running\n");
 70
 71
 72        for(std::size_t i = globalThreadIdx; i < out.getExtents().x(); i += stride)
 73        {
 74            out[i] = lhs[i] + rhs[i];
 75        }
 76    }
 77} // namespace tutorial
 78#endif
 79
 80TEMPLATE_LIST_TEST_CASE("tutorial kernel as function vector add", "[docs]", docs::test::TestBackends)
 81{
 82    auto selector = onHost::makeDeviceSelector(TestType::makeDict()[object::deviceSpec]);
 83    if(!selector.isAvailable())
 84        return;
 85    onHost::concepts::Device auto device = selector.makeDevice(0);
 86    onHost::Queue queue = device.makeQueue();
 87
 88    /* When you use `makeIdxMap`, your algorithm is not subject to any restrictions regarding data size, such as the
 89     * requirement that the data must be a power of two or a multiple of 100.
 90     */
 91    size_t numElements = 293u;
 92    auto lhsBuffer = onHost::alloc<int>(device, numElements);
 93    // Allocate buffers on the compute device.
 94    auto rhsBuffer = onHost::allocLike(device, lhsBuffer);
 95    auto resultBuffer = onHost::allocLike(device, lhsBuffer);
 96
 97    std::vector<int> lhs(numElements);
 98    std::vector<int> rhs(numElements);
 99    std::iota(lhs.begin(), lhs.end(), 0);
100    std::iota(rhs.begin(), rhs.end(), 1000);
101    std::vector<int> result(lhs.size(), -1);
102
103    // Copy input data to the device.
104    onHost::memcpy(queue, lhsBuffer, lhs);
105    onHost::memcpy(queue, rhsBuffer, rhs);
106    onHost::memset(queue, resultBuffer, 0x00);
107
108    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, Vec{numElements});
109    // An alpaka function symbol behaves like a functor with a default constructor.
110    queue.enqueue(frameSpec, KernelBundle{tutorial::VectorAdd{}, resultBuffer, lhsBuffer, rhsBuffer});
111    onHost::wait(queue);
112
113    // Copy the result back and wait for completion before reading it.
114    onHost::memcpy(queue, result, resultBuffer);
115    onHost::wait(queue);
116
117    for(size_t i = 0; i < result.size(); ++i)
118    {
119        CHECK(result[i] == lhs[i] + rhs[i]);
120    }
121}