Random Numbers

Parallel codes often need random numbers for Monte Carlo methods, randomized initialization, sampling, or synthetic test data. alpaka provides random engines and distributions that can be used directly inside kernels.

Using Random Numbers in a Kernel

To avoid correlations between random numbers generated by different threads each thread should get its own deterministic engine state generated by a unique seed. The easiest way to do that is to derive the seed from the loop index. The unique engine state can then be used to generate as many random samples as needed from the chosen distribution.

alpaka supports the following distributions:

  • UniformReal<float or double> for samples in a bounded floating-point interval where all values can occur with equal probability.

  • NormalReal<float or double> for Gaussian-distributed samples with a chosen mean and standard deviation.

Uniform Random Numbers

struct UniformRandomKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed)
        const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
        auto distribution = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            out[idx] = distribution(engine);
        }
    }
};

This example the engine rand::engine::Philox4x32x10 and used the distribution rand::distribution::UniformReal<float> with the half-open interval [0, 1) interval rand::interval::co. The configuration of intervals is explained in the Intervals section. The distribution is used to create a random index per element in the output data container.

Launching the Kernel:

onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
queue.enqueue(frameSpec, KernelBundle{UniformRandomKernel{}, randomBuffer, 1234u});

Uniform distributions are mostly used for probabilities, random offsets, randomized initialization, and rejection sampling. Normal distributions are used for noise models, perturbations around a mean value, and many Monte Carlo methods.

Normal Distribution

NormalReal generates Gaussian noise with a chosen mean and standard deviation. Unlike the uniform distribution, it keeps internal state, therefore you should not share the distribution objects between threads.

struct NormalNoiseKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto out,
        uint32_t seed,
        float mean,
        float stdDev) const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
        {
            rand::distribution::NormalReal<float> normal(mean, stdDev);
            out[idx] = normal(engine);
        }
    }
};

Launching the kernel is the same as before; only the kernel logic changes.

onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
queue.enqueue(frameSpec, KernelBundle{NormalNoiseKernel{}, randomBuffer, 2025u, 5.0f, 2.0f});

Intervals

UniformReal supports four interval tags:

  • rand::interval::co gives [a, b)

  • rand::interval::oc gives (a, b]

  • rand::interval::cc gives [a, b]

  • rand::interval::oo gives (a, b)

The following kernel shows all four forms side by side.

struct IntervalExamplesKernel
{
    ALPAKA_FN_ACC void operator()(
        onAcc::concepts::Acc auto const& acc,
        concepts::IMdSpan auto coValues,
        concepts::IMdSpan auto ocValues,
        concepts::IMdSpan auto ccValues,
        concepts::IMdSpan auto ooValues,
        uint32_t seed) const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{coValues.getExtents()}))
        {
            // 0 <= val <  1
            coValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}(engine);
            // 0 <  val <= 1
            ocValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oc}(engine);
            // 0 <= val <= 1
            ccValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::cc}(engine);
            // 0 <  val <  1
            ooValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oo}(engine);
        }
    }
};

Monte Carlo Pi

A classic example is Monte Carlo estimation of pi. Draw points in the square [0, 1) x [0, 1), count how many land inside the unit quarter circle, and estimate pi from that ratio. The half-open interval matches array-style data access avoids awkward endpoint corner cases.

In the example each worker draws one point, writes 1 if the point falls inside the quarter circle, and then a reduction adds up all hits.

struct MonteCarloPiKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto hits, uint32_t seed)
        const
    {
        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
        // globally unique seed created from a base seed and the thread index within the grid
        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
        auto uniform = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};

        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{hits.getExtents()}))
        {
            auto x = uniform(engine);
            auto y = uniform(engine);
            hits[idx] = (x * x + y * y <= 1.0f) ? 1u : 0u;
        }
    }
};

The reduction happens on the same queue right after the kernel.

queue.enqueue(frameSpec, KernelBundle{MonteCarloPiKernel{}, hitBuffer, 2026u});
onHost::reduce(queue, exec, 0u, hitCountBuffer, std::plus{}, hitBuffer);

After copying back the single reduction result, the estimate itself is just the usual Monte Carlo formula.

auto estimatedPi = 4.0f * static_cast<float>(hostHitCount[0]) / static_cast<float>(numSamples);

Complete Source File

170_random.cpp
  1/* Copyright 2026 René Widera
  2 * SPDX-License-Identifier: ISC
  3 */
  4
  5#include "docsTest.hpp"
  6
  7#include <alpaka/alpaka.hpp>
  8
  9#include <catch2/catch_approx.hpp>
 10#include <catch2/catch_template_test_macros.hpp>
 11#include <catch2/catch_test_macros.hpp>
 12
 13#include <array>
 14#include <cmath>
 15#include <numeric>
 16
 17using namespace alpaka;
 18
 19struct UniformRandomKernel
 20{
 21    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed)
 22        const
 23    {
 24        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
 25        // globally unique seed created from a base seed and the thread index within the grid
 26        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
 27        auto distribution = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};
 28
 29        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
 30        {
 31            out[idx] = distribution(engine);
 32        }
 33    }
 34};
 35
 36
 37struct IntervalExamplesKernel
 38{
 39    ALPAKA_FN_ACC void operator()(
 40        onAcc::concepts::Acc auto const& acc,
 41        concepts::IMdSpan auto coValues,
 42        concepts::IMdSpan auto ocValues,
 43        concepts::IMdSpan auto ccValues,
 44        concepts::IMdSpan auto ooValues,
 45        uint32_t seed) const
 46    {
 47        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
 48        // globally unique seed created from a base seed and the thread index within the grid
 49        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
 50
 51        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{coValues.getExtents()}))
 52        {
 53            // 0 <= val <  1
 54            coValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}(engine);
 55            // 0 <  val <= 1
 56            ocValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oc}(engine);
 57            // 0 <= val <= 1
 58            ccValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::cc}(engine);
 59            // 0 <  val <  1
 60            ooValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oo}(engine);
 61        }
 62    }
 63};
 64
 65
 66struct NormalNoiseKernel
 67{
 68    ALPAKA_FN_ACC void operator()(
 69        onAcc::concepts::Acc auto const& acc,
 70        concepts::IMdSpan auto out,
 71        uint32_t seed,
 72        float mean,
 73        float stdDev) const
 74    {
 75        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
 76        // globally unique seed created from a base seed and the thread index within the grid
 77        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
 78
 79        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
 80        {
 81            rand::distribution::NormalReal<float> normal(mean, stdDev);
 82            out[idx] = normal(engine);
 83        }
 84    }
 85};
 86
 87
 88TEMPLATE_LIST_TEST_CASE("tutorial random numbers", "[docs]", docs::test::TestBackends)
 89{
 90    auto cfg = TestType::makeDict();
 91    auto deviceSpec = cfg[object::deviceSpec];
 92    auto exec = cfg[object::exec];
 93
 94    auto selector = onHost::makeDeviceSelector(deviceSpec);
 95    if(!selector.isAvailable())
 96        return;
 97    onHost::concepts::Device auto device = selector.makeDevice(0);
 98    onHost::Queue queue = device.makeQueue(queueKind::blocking);
 99
100    std::array<float, 8u> hostValues{};
101    auto randomBuffer = onHost::allocLike(device, hostValues);
102
103    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
104    queue.enqueue(frameSpec, KernelBundle{UniformRandomKernel{}, randomBuffer, 1234u});
105
106    onHost::memcpy(queue, hostValues, randomBuffer);
107    onHost::wait(queue);
108
109    float sum = 0.0f;
110    for(auto value : hostValues)
111    {
112        CHECK(value >= 0.0f);
113        CHECK(value < 1.0f);
114        sum += value;
115    }
116
117    CHECK(sum > 0.0f);
118    CHECK(sum < 8.0f);
119}
120
121TEMPLATE_LIST_TEST_CASE("tutorial random intervals", "[docs]", docs::test::TestBackends)
122{
123    auto cfg = TestType::makeDict();
124    auto deviceSpec = cfg[object::deviceSpec];
125    auto exec = cfg[object::exec];
126
127    auto selector = onHost::makeDeviceSelector(deviceSpec);
128    if(!selector.isAvailable())
129        return;
130    onHost::concepts::Device auto device = selector.makeDevice(0);
131    onHost::Queue queue = device.makeQueue(queueKind::blocking);
132
133    std::array<float, 16u> hostCo{};
134    std::array<float, 16u> hostOc{};
135    std::array<float, 16u> hostCc{};
136    std::array<float, 16u> hostOo{};
137
138    auto coBuffer = onHost::allocLike(device, hostCo);
139    auto ocBuffer = onHost::allocLike(device, hostOc);
140    auto ccBuffer = onHost::allocLike(device, hostCc);
141    auto ooBuffer = onHost::allocLike(device, hostOo);
142
143    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, coBuffer.getExtents());
144    queue.enqueue(frameSpec, KernelBundle{IntervalExamplesKernel{}, coBuffer, ocBuffer, ccBuffer, ooBuffer, 999u});
145
146    onHost::memcpy(queue, hostCo, coBuffer);
147    onHost::memcpy(queue, hostOc, ocBuffer);
148    onHost::memcpy(queue, hostCc, ccBuffer);
149    onHost::memcpy(queue, hostOo, ooBuffer);
150    onHost::wait(queue);
151
152    for(size_t i = 0; i < hostCo.size(); ++i)
153    {
154        CHECK(hostCo[i] >= 0.0f);
155        CHECK(hostCo[i] < 1.0f);
156        CHECK(hostOc[i] > 0.0f);
157        CHECK(hostOc[i] <= 1.0f);
158        CHECK(hostCc[i] >= 0.0f);
159        CHECK(hostCc[i] <= 1.0f);
160        CHECK(hostOo[i] > 0.0f);
161        CHECK(hostOo[i] < 1.0f);
162    }
163}
164
165TEMPLATE_LIST_TEST_CASE("tutorial random normal distribution", "[docs]", docs::test::TestBackends)
166{
167    auto cfg = TestType::makeDict();
168    auto deviceSpec = cfg[object::deviceSpec];
169    auto exec = cfg[object::exec];
170
171    auto selector = onHost::makeDeviceSelector(deviceSpec);
172    if(!selector.isAvailable())
173        return;
174    onHost::concepts::Device auto device = selector.makeDevice(0);
175    onHost::Queue queue = device.makeQueue(queueKind::blocking);
176
177    std::array<float, 64u> hostValues{};
178    auto randomBuffer = onHost::allocLike(device, hostValues);
179
180    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
181    queue.enqueue(frameSpec, KernelBundle{NormalNoiseKernel{}, randomBuffer, 2025u, 5.0f, 2.0f});
182
183    onHost::memcpy(queue, hostValues, randomBuffer);
184    onHost::wait(queue);
185
186    float mean = std::accumulate(hostValues.begin(), hostValues.end(), 0.0f) / static_cast<float>(hostValues.size());
187    CHECK(mean > 4.0f);
188    CHECK(mean < 6.0f);
189
190    bool foundBelowMean = false;
191    bool foundAboveMean = false;
192    for(auto value : hostValues)
193    {
194        foundBelowMean = foundBelowMean || value < 5.0f;
195        foundAboveMean = foundAboveMean || value > 5.0f;
196    }
197    CHECK(foundBelowMean);
198    CHECK(foundAboveMean);
199}
200
201struct MonteCarloPiKernel
202{
203    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto hits, uint32_t seed)
204        const
205    {
206        auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
207        // globally unique seed created from a base seed and the thread index within the grid
208        rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
209        auto uniform = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};
210
211        for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{hits.getExtents()}))
212        {
213            auto x = uniform(engine);
214            auto y = uniform(engine);
215            hits[idx] = (x * x + y * y <= 1.0f) ? 1u : 0u;
216        }
217    }
218};
219
220
221TEMPLATE_LIST_TEST_CASE("tutorial monte carlo pi", "[docs]", docs::test::TestBackends)
222{
223    auto cfg = TestType::makeDict();
224    auto deviceSpec = cfg[object::deviceSpec];
225    auto exec = cfg[object::exec];
226
227    auto selector = onHost::makeDeviceSelector(deviceSpec);
228    if(!selector.isAvailable())
229        return;
230    onHost::concepts::Device auto device = selector.makeDevice(0);
231    onHost::Queue queue = device.makeQueue(queueKind::blocking);
232
233    constexpr uint32_t numSamples = 16384u;
234    auto hitBuffer = onHost::alloc<uint32_t>(device, Vec{numSamples});
235    auto hitCountBuffer = onHost::alloc<uint32_t>(device, Vec{1u});
236    auto hostHitCount = onHost::allocHostLike(hitCountBuffer);
237
238    onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, hitBuffer.getExtents());
239
240    queue.enqueue(frameSpec, KernelBundle{MonteCarloPiKernel{}, hitBuffer, 2026u});
241    onHost::reduce(queue, exec, 0u, hitCountBuffer, std::plus{}, hitBuffer);
242
243    onHost::memcpy(queue, hostHitCount, hitCountBuffer);
244    onHost::wait(queue);
245
246    auto estimatedPi = 4.0f * static_cast<float>(hostHitCount[0]) / static_cast<float>(numSamples);
247
248    CHECK(estimatedPi == Catch::Approx(3.14159f).margin(0.15f));
249}