Kernel Function
In the kernel tutorial you learned that an alpaka kernel is represented by a functor. Internally, a kernel launch always invokes a functor, but alpaka also provides a convenient way to implement kernels as free functions. This approach offers a simple syntax for specializing a kernel implementation for a specific API or device kind while still using the same kernel interface from the application code.
This functionality is provided by alpaka’s generic function interface.
Writing a Kernel as a Function
This example uses the simple one-dimensional vector-add kernel from the kernel tutorial.
Instead of defining a functor with operator(), you implement a function named fnDispatch.
The first argument must be an alpaka function symbol or a function-symbol specialization, which allows alpaka to select the correct implementation at compile time.
Function symbols and all corresponding fnDispatch overloads must be placed inside a namespace.
alpaka relies on Argument-Dependent Lookup (ADL) to find matching overloads, and functions located in the global namespace cannot be found reliably.
The following example defines a generic vector-add kernel implementation that can run on any supported backend.
namespace tutorial
{
/* The function symbol is only defined without specifying the argument signature.
* You need to provide at least a generic function dispatch signature for the symbol. */
ALPAKA_FN_SYMBOL(VectorAdd);
// Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
ALPAKA_FN_ACC void fnDispatch(
VectorAdd,
onAcc::concepts::Acc auto const& acc,
concepts::IMdSpan auto out,
concepts::IDataSource auto const& lhs,
concepts::IDataSource auto const& rhs)
{
ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());
auto [globalThreadIdx] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
if(globalThreadIdx == 0u)
printf("The alpaka kernel implementation is running\n");
for(auto [i] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
{
out[i] = lhs[i] + rhs[i];
}
}
} // namespace tutorial
Launching a kernel function is identical to launching a kernel functor.
The function symbol behaves like a stateless functor with a default constructor and can therefore be passed directly to a KernelBundle.
onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, Vec{numElements});
// An alpaka function symbol behaves like a functor with a default constructor.
queue.enqueue(frameSpec, KernelBundle{tutorial::VectorAdd{}, resultBuffer, lhsBuffer, rhsBuffer});
onHost::wait(queue);
Specializing a Kernel Function for CUDA
In some situations you may want to provide a backend-specific implementation. Typical reasons include:
using native backend features that are not available through the portable abstraction
achieving maximum performance for a specific platform
gradually porting existing CUDA code to alpaka
This can be achieved by overloading fnDispatch for a specific API and device kind.
When a kernel is launched on a matching backend, alpaka automatically selects the specialized overload.
The example below provides a CUDA-specific implementation using native CUDA thread and grid indices.
Because this implementation depends on CUDA-specific language features such as blockIdx and threadIdx, it must be guarded with a pre-compiler macro ALPAKA_LANG_CUDA.
Without this guard, compilation would fail on systems where CUDA support is not enabled.
#if ALPAKA_LANG_CUDA
namespace tutorial
{
/* If the kernel is enqueued to a CUDA queue this overload is used as kernel.
* It is required to guard the code with preprocessor guard ALPAKA_LANG_CUDA because the kernel is not genericdue
* to the usage of CUDA build in variables.
*/
template<typename T_DeviceKind>
ALPAKA_FN_ACC void fnDispatch(
VectorAdd::Spec<api::Cuda, T_DeviceKind>,
onAcc::concepts::Acc auto const& acc,
concepts::IMdSpan auto out,
concepts::IDataSource auto const& lhs,
concepts::IDataSource auto const& rhs)
{
ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());
std::size_t const globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
std::size_t const stride = blockDim.x * gridDim.x;
if(globalThreadIdx == 0u)
printf("The native CUDA kernel implementation is running\n");
for(std::size_t i = globalThreadIdx; i < out.getExtents().x(); i += stride)
{
out[i] = lhs[i] + rhs[i];
}
}
} // namespace tutorial
#endif
Overload Resolution
When a kernel is launched, alpaka selects the most specialized matching fnDispatch overload.
For example:
fnDispatch(VectorAdd, ...)provides a generic implementation.fnDispatch(VectorAdd::Spec<api::Cuda, T_DeviceKind>, ...)provides a CUDA-specific implementation.fnDispatch(VectorAdd::Spec<api::OneApi, deviceKind::IntelGpu>, ...)provides a implementation for the Intel GPU accessed via OneApi. (not used in this example)
If the kernel is executed on a queue for an CUDA device, the CUDA specialization is selected. For all other backends, the generic implementation is used automatically. This allows application code to remain unchanged while backend-specific optimizations are added where needed.
Complete Source File
230_kernelFn.cpp
1/* Copyright 2026 René Widera
2 * SPDX-License-Identifier: ISC
3 */
4
5#include "docsTest.hpp"
6
7#include <alpaka/alpaka.hpp>
8
9#include <catch2/catch_template_test_macros.hpp>
10#include <catch2/catch_test_macros.hpp>
11
12#include <numeric>
13#include <vector>
14
15using namespace alpaka;
16
17namespace tutorial
18{
19
20 /* The function symbol is only defined without specifying the argument signature.
21 * You need to provide at least a generic function dispatch signature for the symbol. */
22 ALPAKA_FN_SYMBOL(VectorAdd);
23
24 // Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
25 ALPAKA_FN_ACC void fnDispatch(
26 VectorAdd,
27 onAcc::concepts::Acc auto const& acc,
28 concepts::IMdSpan auto out,
29 concepts::IDataSource auto const& lhs,
30 concepts::IDataSource auto const& rhs)
31 {
32 ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
33 ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());
34
35 auto [globalThreadIdx] = acc.getIdxWithin(onAcc::origin::grid, onAcc::unit::threads);
36 if(globalThreadIdx == 0u)
37 printf("The alpaka kernel implementation is running\n");
38
39 for(auto [i] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
40 {
41 out[i] = lhs[i] + rhs[i];
42 }
43 }
44} // namespace tutorial
45
46
47#if ALPAKA_LANG_CUDA
48namespace tutorial
49{
50 /* If the kernel is enqueued to a CUDA queue this overload is used as kernel.
51 * It is required to guard the code with preprocessor guard ALPAKA_LANG_CUDA because the kernel is not genericdue
52 * to the usage of CUDA build in variables.
53 */
54 template<typename T_DeviceKind>
55 ALPAKA_FN_ACC void fnDispatch(
56 VectorAdd::Spec<api::Cuda, T_DeviceKind>,
57 onAcc::concepts::Acc auto const& acc,
58 concepts::IMdSpan auto out,
59 concepts::IDataSource auto const& lhs,
60 concepts::IDataSource auto const& rhs)
61 {
62 ALPAKA_ASSERT_ACC(out.getExtents() == lhs.getExtents());
63 ALPAKA_ASSERT_ACC(out.getExtents() == rhs.getExtents());
64
65 std::size_t const globalThreadIdx = blockIdx.x * blockDim.x + threadIdx.x;
66 std::size_t const stride = blockDim.x * gridDim.x;
67
68 if(globalThreadIdx == 0u)
69 printf("The native CUDA kernel implementation is running\n");
70
71
72 for(std::size_t i = globalThreadIdx; i < out.getExtents().x(); i += stride)
73 {
74 out[i] = lhs[i] + rhs[i];
75 }
76 }
77} // namespace tutorial
78#endif
79
80TEMPLATE_LIST_TEST_CASE("tutorial kernel as function vector add", "[docs]", docs::test::TestBackends)
81{
82 auto selector = onHost::makeDeviceSelector(TestType::makeDict()[object::deviceSpec]);
83 if(!selector.isAvailable())
84 return;
85 onHost::concepts::Device auto device = selector.makeDevice(0);
86 onHost::Queue queue = device.makeQueue();
87
88 /* When you use `makeIdxMap`, your algorithm is not subject to any restrictions regarding data size, such as the
89 * requirement that the data must be a power of two or a multiple of 100.
90 */
91 size_t numElements = 293u;
92 auto lhsBuffer = onHost::alloc<int>(device, numElements);
93 // Allocate buffers on the compute device.
94 auto rhsBuffer = onHost::allocLike(device, lhsBuffer);
95 auto resultBuffer = onHost::allocLike(device, lhsBuffer);
96
97 std::vector<int> lhs(numElements);
98 std::vector<int> rhs(numElements);
99 std::iota(lhs.begin(), lhs.end(), 0);
100 std::iota(rhs.begin(), rhs.end(), 1000);
101 std::vector<int> result(lhs.size(), -1);
102
103 // Copy input data to the device.
104 onHost::memcpy(queue, lhsBuffer, lhs);
105 onHost::memcpy(queue, rhsBuffer, rhs);
106 onHost::memset(queue, resultBuffer, 0x00);
107
108 onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, Vec{numElements});
109 // An alpaka function symbol behaves like a functor with a default constructor.
110 queue.enqueue(frameSpec, KernelBundle{tutorial::VectorAdd{}, resultBuffer, lhsBuffer, rhsBuffer});
111 onHost::wait(queue);
112
113 // Copy the result back and wait for completion before reading it.
114 onHost::memcpy(queue, result, resultBuffer);
115 onHost::wait(queue);
116
117 for(size_t i = 0; i < result.size(); ++i)
118 {
119 CHECK(result[i] == lhs[i] + rhs[i]);
120 }
121}