Random Numbers

Parallel codes often need random numbers for Monte Carlo methods, randomized initialization, sampling, or synthetic test data. alpaka provides random engines and distributions that can be used directly inside kernels.

Using Random Numbers in a Kernel

To avoid correlations between random numbers generated by different threads each thread should get its own deterministic engine state generated by a unique seed. The easiest way to do that is to derive the seed from the loop index. The unique engine state can then be used to generate as many random samples as needed from the chosen distribution.

alpaka supports the following distributions:

UniformReal<float or double> for samples in a bounded floating-point interval where all values can occur with equal probability.
NormalReal<float or double> for Gaussian-distributed samples with a chosen mean and standard deviation.

Uniform Random Numbers

struct UniformRandomKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed)
        const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
        auto distribution = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            out[idx] = distribution(engine);
        }
    }
};

This example the engine rand::engine::Philox4x32x10 and used the distribution rand::distribution::UniformReal<float> with the half-open interval [0, 1) interval rand::interval::co. The configuration of intervals is explained in the Intervals section. The distribution is used to create a random index per element in the output data container.

Launching the Kernel:

onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
queue.enqueue(frameSpec, KernelBundle{UniformRandomKernel{}, randomBuffer, 1234u});

Uniform distributions are mostly used for probabilities, random offsets, randomized initialization, and rejection sampling. Normal distributions are used for noise models, perturbations around a mean value, and many Monte Carlo methods.

Normal Distribution

NormalReal generates Gaussian noise with a chosen mean and standard deviation. Unlike the uniform distribution, it keeps internal state, therefore you should not share the distribution objects between threads.

struct NormalNoiseKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        uint32_t seed,
        float mean,
        float stdDev) const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            rand::distribution::NormalReal<float> normal(mean, stdDev);
            out[idx] = normal(engine);
        }
    }
};

Launching the kernel is the same as before; only the kernel logic changes.

onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
queue.enqueue(frameSpec, KernelBundle{NormalNoiseKernel{}, randomBuffer, 2025u, 5.0f, 2.0f});

Intervals

UniformReal supports four interval tags:

rand::interval::co gives [a, b)
rand::interval::oc gives (a, b]
rand::interval::cc gives [a, b]
rand::interval::oo gives (a, b)

The following kernel shows all four forms side by side.

struct IntervalExamplesKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto coValues,
        concepts::IMdSpan auto ocValues,
        concepts::IMdSpan auto ccValues,
        concepts::IMdSpan auto ooValues,
        uint32_t seed) const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{coValues.getExtents()}))
        {
            // 0 <= val <  1
            coValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}(engine);
            // 0 <  val <= 1
            ocValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oc}(engine);
            // 0 <= val <= 1
            ccValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::cc}(engine);
            // 0 <  val <  1
            ooValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oo}(engine);
        }
    }
};

Monte Carlo Pi

A classic example is Monte Carlo estimation of pi. Draw points in the square [0, 1) x [0, 1), count how many land inside the unit quarter circle, and estimate pi from that ratio. The half-open interval matches array-style data access avoids awkward endpoint corner cases.

In the example each worker draws one point, writes 1 if the point falls inside the quarter circle, and then a reduction adds up all hits.

struct MonteCarloPiKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto hits, uint32_t seed)
        const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
        auto uniform = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{hits.getExtents()}))
        {
            auto x = uniform(engine);
            auto y = uniform(engine);
            hits[idx] = (x * x + y * y <= 1.0f) ? 1u : 0u;
        }
    }
};

The reduction happens on the same queue right after the kernel.

queue.enqueue(frameSpec, KernelBundle{MonteCarloPiKernel{}, hitBuffer, 2026u});
onHost::reduce(queue, exec, 0u, hitCountBuffer, std::plus{}, hitBuffer);

After copying back the single reduction result, the estimate itself is just the usual Monte Carlo formula.

auto estimatedPi = 4.0f * static_cast<float>(hostHitCount[0]) / static_cast<float>(numSamples);

Complete Source File

170_random.cpp

/* Copyright 2026 René Widera
 * SPDX-License-Identifier: ISC
 */

#include "docsTest.hpp"

#include <alpaka/alpaka.hpp>

#include <catch2/catch_approx.hpp>
#include <catch2/catch_template_test_macros.hpp>
#include <catch2/catch_test_macros.hpp>

#include <array>
#include <cmath>
#include <numeric>

using namespace alpaka;

struct UniformRandomKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed)
        const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
        auto distribution = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            out[idx] = distribution(engine);
        }
    }
};


struct IntervalExamplesKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto coValues,
        concepts::IMdSpan auto ocValues,
        concepts::IMdSpan auto ccValues,
        concepts::IMdSpan auto ooValues,
        uint32_t seed) const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{coValues.getExtents()}))
        {
            // 0 <= val <  1
            coValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}(engine);
            // 0 <  val <= 1
            ocValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oc}(engine);
            // 0 <= val <= 1
            ccValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::cc}(engine);
            // 0 <  val <  1
            ooValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oo}(engine);
        }
    }
};


struct NormalNoiseKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        uint32_t seed,
        float mean,
        float stdDev) const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            rand::distribution::NormalReal<float> normal(mean, stdDev);
            out[idx] = normal(engine);
        }
    }
};


TEMPLATE_LIST_TEST_CASE("tutorial random numbers", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    std::array<float, 8u> hostValues{};
    auto randomBuffer = onHost::allocLike(device, hostValues);

    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
    queue.enqueue(frameSpec, KernelBundle{UniformRandomKernel{}, randomBuffer, 1234u});

    onHost::memcpy(queue, hostValues, randomBuffer);
    onHost::wait(queue);

    float sum = 0.0f;
    for(auto value : hostValues)
    {
        CHECK(value >= 0.0f);
        CHECK(value < 1.0f);
        sum += value;
    }

    CHECK(sum > 0.0f);
    CHECK(sum < 8.0f);
}

TEMPLATE_LIST_TEST_CASE("tutorial random intervals", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    std::array<float, 16u> hostCo{};
    std::array<float, 16u> hostOc{};
    std::array<float, 16u> hostCc{};
    std::array<float, 16u> hostOo{};

    auto coBuffer = onHost::allocLike(device, hostCo);
    auto ocBuffer = onHost::allocLike(device, hostOc);
    auto ccBuffer = onHost::allocLike(device, hostCc);
    auto ooBuffer = onHost::allocLike(device, hostOo);

    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, coBuffer.getExtents());
    queue.enqueue(frameSpec, KernelBundle{IntervalExamplesKernel{}, coBuffer, ocBuffer, ccBuffer, ooBuffer, 999u});

    onHost::memcpy(queue, hostCo, coBuffer);
    onHost::memcpy(queue, hostOc, ocBuffer);
    onHost::memcpy(queue, hostCc, ccBuffer);
    onHost::memcpy(queue, hostOo, ooBuffer);
    onHost::wait(queue);

    for(size_t i = 0; i < hostCo.size(); ++i)
    {
        CHECK(hostCo[i] >= 0.0f);
        CHECK(hostCo[i] < 1.0f);
        CHECK(hostOc[i] > 0.0f);
        CHECK(hostOc[i] <= 1.0f);
        CHECK(hostCc[i] >= 0.0f);
        CHECK(hostCc[i] <= 1.0f);
        CHECK(hostOo[i] > 0.0f);
        CHECK(hostOo[i] < 1.0f);
    }
}

TEMPLATE_LIST_TEST_CASE("tutorial random normal distribution", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    std::array<float, 64u> hostValues{};
    auto randomBuffer = onHost::allocLike(device, hostValues);

    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
    queue.enqueue(frameSpec, KernelBundle{NormalNoiseKernel{}, randomBuffer, 2025u, 5.0f, 2.0f});

    onHost::memcpy(queue, hostValues, randomBuffer);
    onHost::wait(queue);

    float mean = std::accumulate(hostValues.begin(), hostValues.end(), 0.0f) / static_cast<float>(hostValues.size());
    CHECK(mean > 4.0f);
    CHECK(mean < 6.0f);

    bool foundBelowMean = false;
    bool foundAboveMean = false;
    for(auto value : hostValues)
    {
        foundBelowMean = foundBelowMean || value < 5.0f;
        foundAboveMean = foundAboveMean || value > 5.0f;
    }
    CHECK(foundBelowMean);
    CHECK(foundAboveMean);
}

struct MonteCarloPiKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto hits, uint32_t seed)
        const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
        auto uniform = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{hits.getExtents()}))
        {
            auto x = uniform(engine);
            auto y = uniform(engine);
            hits[idx] = (x * x + y * y <= 1.0f) ? 1u : 0u;
        }
    }
};


TEMPLATE_LIST_TEST_CASE("tutorial monte carlo pi", "[docs]", docs::test::TestBackends)
{
    auto cfg = TestType::makeDict();
    auto deviceSpec = cfg[object::deviceSpec];
    auto exec = cfg[object::exec];

    auto selector = onHost::makeDeviceSelector(deviceSpec);
    if(!selector.isAvailable())
        return;
    onHost::concepts::Device auto device = selector.makeDevice(0);
    onHost::Queue queue = device.makeQueue(queueKind::blocking);

    constexpr uint32_t numSamples = 16384u;
    auto hitBuffer = onHost::alloc<uint32_t>(device, Vec{numSamples});
    auto hitCountBuffer = onHost::alloc<uint32_t>(device, Vec{1u});
    auto hostHitCount = onHost::allocHostLike(hitCountBuffer);

    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, hitBuffer.getExtents());

    queue.enqueue(frameSpec, KernelBundle{MonteCarloPiKernel{}, hitBuffer, 2026u});
    onHost::reduce(queue, exec, 0u, hitCountBuffer, std::plus{}, hitBuffer);

    onHost::memcpy(queue, hostHitCount, hitCountBuffer);
    onHost::wait(queue);

    auto estimatedPi = 4.0f * static_cast<float>(hostHitCount[0]) / static_cast<float>(numSamples);

    CHECK(estimatedPi == Catch::Approx(3.14159f).margin(0.15f));
}