Vendor and Third-Party Interop

At some point you would ask yourself how you can use native vendor implementations e.g. for BLAS, FFT or other functions together with alpaka. You might want to call thrust::transform on CUDA, rocPRIM on HIP, a oneAPI library on SYCL, or even a CPU-side library function on the host backend. alpaka provides a function-symbol interface for this request, which keeps the code to call vendor functions clean and readable without introducing preprocessor macros around the function calls.

The following steps are required:

Define an alpaka function symbol with ALPAKA_FN_SYMBOL(symbolName)
Optionally implement a generic alpaka fallback for any device.
Specialize implementations for the backends that have a special vendor path.

Call the symbol via symbolName::call(queue, ...) or symbolName{}(queue, ...) and the correct implementation will be dispatched based on the API and device kind derived from the first argument, in this case the queue.

The example a tiny image-processing operation is used. Each input value is read as a pixel intensity, and the operation computes scale * value + shift. The affine operation itself is a tiny functor.

struct AffineTransformOp
{
    float scale;
    float shift;

    ALPAKA_FN_ACC auto operator()(float const& value) const -> float
    {
        return scale * value + shift;
    }
};

For the API host and the device kind cpu we will fallback to the C++ standard implementation via std::transform.

Defining a Dispatchable Function

/* The function symbol is only defined without specifying the argument signature.
 * You need to provide at least a generic function dispatch signature for the symbol.
 */
ALPAKA_FN_SYMBOL(AffineTransform);
It is allowed later to declare different dispatch function signatures for the same function symbol. The function dispatch order follows the C++ rules for function overloading.

Registering a Generic Fallback

/* Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
 * `input` and `output` should be one-dimensional, enforced by the required clause, to build a unified interface
 * because std::transform used in the host CPU overload only supports one dimensional memory.
 */
template<typename T_Queue, alpaka::concepts::IMdSpan T_Output, alpaka::concepts::IMdSpan T_Input>
constexpr void alpakaFnDispatch(
    AffineTransform,
    T_Queue&& queue,
    T_Output&& output,
    float scale,
    float shift,
    T_Input&& input) requires(concepts::Dim<ALPAKA_TYPEOF(input), 1u> && concepts::Dim<ALPAKA_TYPEOF(output), 1u>)
{
    // Forward all arguments because the function signature chosen here matches the alpaka transform interface.
    alpaka::onHost::transform(
        ALPAKA_FORWARD(queue),
        ALPAKA_FORWARD(output),
        ScalarFunc{AffineTransformOp{scale, shift}},
        ALPAKA_FORWARD(input));
}

This overload is the portable baseline. It works on every backend that can run the normal alpaka algorithm path, so it is a good default even when you later add CUDA-, HIP-, or SYCL-specific overloads.

Registering an API-Device-Specific Overload

/* This overload is used if the queue API is `api::Host` and the device kind is `deviceKind::Cpu`.
 * `input` and `output` should be one-dimensional, enforced by the requirement clause, due to the limitations of
 * std::transform used for the implementation.
 */
template<typename T_Queue, alpaka::concepts::IMdSpan T_Output, alpaka::concepts::IMdSpan T_Input>
constexpr void alpakaFnDispatch(
    AffineTransform::Spec<alpaka::api::Host, alpaka::deviceKind::Cpu>,
    T_Queue&& queue,
    T_Output&& output,
    float scale,
    float shift,
    T_Input&& input) requires(concepts::Dim<ALPAKA_TYPEOF(input), 1u> && concepts::Dim<ALPAKA_TYPEOF(output), 1u>)
{
    /* Enqueue support only const lambdas/functors but the pointer must be writable, therefore create a copy of the
     * pointer before the multidimensional span becomes const due to the lambda. For IMdSpan the constness will be
     * propagated to the data.
     */
    auto outPtr = output.data();
    // Enqueue the operation via a host function to ensure the order of executions within a non-blocking queue.
    queue.enqueueHostFn(
        [=]()
        {
            std::transform(
                input.data(),
                input.data() + input.getExtents().x(),
                outPtr,
                AffineTransformOp{scale, shift});
        });
}

This example uses std::transform as a small stand-in for a third-party backend function. The pattern is the same when the backend-specific code comes from a GPU vendor library. On CUDA, for example, this is where you would pass queue.getNativeHandle() to a library that expects a CUDA stream and then call the vendor routine there. This host-specific overload is intentionally constrained to 1D spans because the example forwards to std::transform over a single contiguous range.

The important part is the Spec<api, deviceKind> type:

it states which backend the overload belongs to,
it keeps the backend choice out of the call site,
and it lets the same public function symbol dispatch differently for different queues and devices.

Calling the Function

The call itself stays simple. You pass the queue and the ordinary data arguments. alpaka looks at the queue’s specification and forwards the call to the best matching overload.

std::array<float, 5u> hostInput{1.f, 2.f, 3.f, 4.f, 5.f};
std::array<float, 5u> hostOutput{};

auto inputBuffer = onHost::allocLike(device, hostInput);
auto outputBuffer = onHost::allocLike(device, hostOutput);

onHost::memcpy(queue, inputBuffer, hostInput);

/* Call the function, the overload will be dispatched based on the properties of the queue.
 *
 * You can also create an instance of the alpaka function symbol instead of using ::call().
 * This allows using a function symbol as an argument of a method.
 *
 * example: `vendorTutorial::AffineTransform{}(....)`
 */
vendorTutorial::AffineTransform::call(queue, outputBuffer, 2.0f, 0.5f, inputBuffer);

onHost::memcpy(queue, hostOutput, outputBuffer);
onHost::wait(queue);

alpaka’s function interface is not limited to the usage on the host side only. As shown in the function kernel tutorial you can write kernels which are specializable for a device kind and/or API. You can also use it to call vendor/third-party functions from within a kernel (onAcc); in this case do not forget to mark the function with the attribute ALPAKA_FN_ACC or constexpr, otherwise some device compilers may fail to compile these functions.

Complete Source File

200_vendorInterop.cpp

/* Copyright 2026 René Widera
 * SPDX-License-Identifier: ISC
 */

#include "docsTest.hpp"

#include <alpaka/alpaka.hpp>

#include <catch2/catch_template_test_macros.hpp>
#include <catch2/catch_test_macros.hpp>

#include <algorithm>
#include <array>

using namespace alpaka;

namespace vendorTutorial
{
    struct AffineTransformOp
    {
        float scale;
        float shift;

        ALPAKA_FN_ACC auto operator()(float const& value) const -> float
        {
            return scale * value + shift;
        }
    };


    /* The function symbol is only defined without specifying the argument signature.
     * You need to provide at least a generic function dispatch signature for the symbol.
     */
    ALPAKA_FN_SYMBOL(AffineTransform);


    /* Genic function dispatch signature which is used if no more specific specification for the symbol is provided.
     * `input` and `output` should be one-dimensional, enforced by the required clause, to build a unified interface
     * because std::transform used in the host CPU overload only supports one dimensional memory.
     */
    template<typename T_Queue, alpaka::concepts::IMdSpan T_Output, alpaka::concepts::IMdSpan T_Input>
    constexpr void alpakaFnDispatch(
        AffineTransform,
        T_Queue&& queue,
        T_Output&& output,
        float scale,
        float shift,
        T_Input&& input) requires(concepts::Dim<ALPAKA_TYPEOF(input), 1u> && concepts::Dim<ALPAKA_TYPEOF(output), 1u>)
    {
        // Forward all arguments because the function signature chosen here matches the alpaka transform interface.
        alpaka::onHost::transform(
            ALPAKA_FORWARD(queue),
            ALPAKA_FORWARD(output),
            ScalarFunc{AffineTransformOp{scale, shift}},
            ALPAKA_FORWARD(input));
    }


    /* This overload is used if the queue API is `api::Host` and the device kind is `deviceKind::Cpu`.
     * `input` and `output` should be one-dimensional, enforced by the requirement clause, due to the limitations of
     * std::transform used for the implementation.
     */
    template<typename T_Queue, alpaka::concepts::IMdSpan T_Output, alpaka::concepts::IMdSpan T_Input>
    constexpr void alpakaFnDispatch(
        AffineTransform::Spec<alpaka::api::Host, alpaka::deviceKind::Cpu>,
        T_Queue&& queue,
        T_Output&& output,
        float scale,
        float shift,
        T_Input&& input) requires(concepts::Dim<ALPAKA_TYPEOF(input), 1u> && concepts::Dim<ALPAKA_TYPEOF(output), 1u>)
    {
        /* Enqueue support only const lambdas/functors but the pointer must be writable, therefore create a copy of the
         * pointer before the multidimensional span becomes const due to the lambda. For IMdSpan the constness will be
         * propagated to the data.
         */
        auto outPtr = output.data();
        // Enqueue the operation via a host function to ensure the order of executions within a non-blocking queue.
        queue.enqueueHostFn(
            [=]()
            {
                std::transform(
                    input.data(),
                    input.data() + input.getExtents().x(),
                    outPtr,
                    AffineTransformOp{scale, shift});
            });
    }

} // namespace vendorTutorial

TEMPLATE_LIST_TEST_CASE("tutorial vendor interop dispatch", "[docs]", docs::test::TestBackends)
{
    auto selector = onHost::makeDeviceSelector(TestType::makeDict()[object::deviceSpec]);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    std::array<float, 5u> hostInput{1.f, 2.f, 3.f, 4.f, 5.f};
    std::array<float, 5u> hostOutput{};

    auto inputBuffer = onHost::allocLike(device, hostInput);
    auto outputBuffer = onHost::allocLike(device, hostOutput);

    onHost::memcpy(queue, inputBuffer, hostInput);

    /* Call the function, the overload will be dispatched based on the properties of the queue.
     *
     * You can also create an instance of the alpaka function symbol instead of using ::call().
     * This allows using a function symbol as an argument of a method.
     *
     * example: `vendorTutorial::AffineTransform{}(....)`
     */
    vendorTutorial::AffineTransform::call(queue, outputBuffer, 2.0f, 0.5f, inputBuffer);

    onHost::memcpy(queue, hostOutput, outputBuffer);
    onHost::wait(queue);

    CHECK(hostOutput[0] == 2.5f);
    CHECK(hostOutput[1] == 4.5f);
    CHECK(hostOutput[2] == 6.5f);
    CHECK(hostOutput[3] == 8.5f);
    CHECK(hostOutput[4] == 10.5f);
}