Random Numbers
Parallel codes often need random numbers for Monte Carlo methods, randomized initialization, sampling, or synthetic test data. alpaka provides random engines and distributions that can be used directly inside kernels.
Using Random Numbers in a Kernel
To avoid correlations between random numbers generated by different threads each thread should get its own deterministic engine state generated by a unique seed. The easiest way to do that is to derive the seed from the loop index. The unique engine state can then be used to generate as many random samples as needed from the chosen distribution.
alpaka supports the following distributions:
UniformReal<float or double>for samples in a bounded floating-point interval where all values can occur with equal probability.NormalReal<float or double>for Gaussian-distributed samples with a chosen mean and standard deviation.
Uniform Random Numbers
struct UniformRandomKernel { ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed) const { auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads); // globally unique seed created from a base seed and the thread index within the grid rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid); auto distribution = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}; for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()})) { out[idx] = distribution(engine); } } };
This example the engine rand::engine::Philox4x32x10 and used the distribution rand::distribution::UniformReal<float> with the half-open interval [0, 1) interval rand::interval::co.
The configuration of intervals is explained in the Intervals section.
The distribution is used to create a random index per element in the output data container.
Launching the Kernel:
onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents()); queue.enqueue(frameSpec, KernelBundle{UniformRandomKernel{}, randomBuffer, 1234u});
Uniform distributions are mostly used for probabilities, random offsets, randomized initialization, and rejection sampling. Normal distributions are used for noise models, perturbations around a mean value, and many Monte Carlo methods.
Normal Distribution
NormalReal generates Gaussian noise with a chosen mean and standard deviation.
Unlike the uniform distribution, it keeps internal state, therefore you should not share the distribution objects between threads.
struct NormalNoiseKernel { ALPAKA_FN_ACC void operator()( onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed, float mean, float stdDev) const { auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads); // globally unique seed created from a base seed and the thread index within the grid rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid); for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()})) { rand::distribution::NormalReal<float> normal(mean, stdDev); out[idx] = normal(engine); } } };
Launching the kernel is the same as before; only the kernel logic changes.
onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents()); queue.enqueue(frameSpec, KernelBundle{NormalNoiseKernel{}, randomBuffer, 2025u, 5.0f, 2.0f});
Intervals
UniformReal supports four interval tags:
rand::interval::cogives[a, b)rand::interval::ocgives(a, b]rand::interval::ccgives[a, b]rand::interval::oogives(a, b)
The following kernel shows all four forms side by side.
struct IntervalExamplesKernel { ALPAKA_FN_ACC void operator()( onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto coValues, concepts::IMdSpan auto ocValues, concepts::IMdSpan auto ccValues, concepts::IMdSpan auto ooValues, uint32_t seed) const { auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads); // globally unique seed created from a base seed and the thread index within the grid rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid); for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{coValues.getExtents()})) { // 0 <= val < 1 coValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}(engine); // 0 < val <= 1 ocValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oc}(engine); // 0 <= val <= 1 ccValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::cc}(engine); // 0 < val < 1 ooValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oo}(engine); } } };
Monte Carlo Pi
A classic example is Monte Carlo estimation of pi.
Draw points in the square [0, 1) x [0, 1), count how many land inside the unit quarter circle, and estimate pi from that ratio.
The half-open interval matches array-style data access avoids awkward endpoint corner cases.
In the example each worker draws one point, writes 1 if the point falls inside the quarter circle, and then a reduction adds up all hits.
struct MonteCarloPiKernel { ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto hits, uint32_t seed) const { auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads); // globally unique seed created from a base seed and the thread index within the grid rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid); auto uniform = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}; for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{hits.getExtents()})) { auto x = uniform(engine); auto y = uniform(engine); hits[idx] = (x * x + y * y <= 1.0f) ? 1u : 0u; } } };
The reduction happens on the same queue right after the kernel.
queue.enqueue(frameSpec, KernelBundle{MonteCarloPiKernel{}, hitBuffer, 2026u}); onHost::reduce(queue, exec, 0u, hitCountBuffer, std::plus{}, hitBuffer);
After copying back the single reduction result, the estimate itself is just the usual Monte Carlo formula.
auto estimatedPi = 4.0f * static_cast<float>(hostHitCount[0]) / static_cast<float>(numSamples);
Complete Source File
170_random.cpp
1/* Copyright 2026 René Widera
2 * SPDX-License-Identifier: ISC
3 */
4
5#include "docsTest.hpp"
6
7#include <alpaka/alpaka.hpp>
8
9#include <catch2/catch_approx.hpp>
10#include <catch2/catch_template_test_macros.hpp>
11#include <catch2/catch_test_macros.hpp>
12
13#include <array>
14#include <cmath>
15#include <numeric>
16
17using namespace alpaka;
18
19struct UniformRandomKernel
20{
21 ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto out, uint32_t seed)
22 const
23 {
24 auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
25 // globally unique seed created from a base seed and the thread index within the grid
26 rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
27 auto distribution = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};
28
29 for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
30 {
31 out[idx] = distribution(engine);
32 }
33 }
34};
35
36
37struct IntervalExamplesKernel
38{
39 ALPAKA_FN_ACC void operator()(
40 onAcc::concepts::Acc auto const& acc,
41 concepts::IMdSpan auto coValues,
42 concepts::IMdSpan auto ocValues,
43 concepts::IMdSpan auto ccValues,
44 concepts::IMdSpan auto ooValues,
45 uint32_t seed) const
46 {
47 auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
48 // globally unique seed created from a base seed and the thread index within the grid
49 rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
50
51 for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{coValues.getExtents()}))
52 {
53 // 0 <= val < 1
54 coValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co}(engine);
55 // 0 < val <= 1
56 ocValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oc}(engine);
57 // 0 <= val <= 1
58 ccValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::cc}(engine);
59 // 0 < val < 1
60 ooValues[idx] = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::oo}(engine);
61 }
62 }
63};
64
65
66struct NormalNoiseKernel
67{
68 ALPAKA_FN_ACC void operator()(
69 onAcc::concepts::Acc auto const& acc,
70 concepts::IMdSpan auto out,
71 uint32_t seed,
72 float mean,
73 float stdDev) const
74 {
75 auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
76 // globally unique seed created from a base seed and the thread index within the grid
77 rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
78
79 for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{out.getExtents()}))
80 {
81 rand::distribution::NormalReal<float> normal(mean, stdDev);
82 out[idx] = normal(engine);
83 }
84 }
85};
86
87
88TEMPLATE_LIST_TEST_CASE("tutorial random numbers", "[docs]", docs::test::TestBackends)
89{
90 auto cfg = TestType::makeDict();
91 auto deviceSpec = cfg[object::deviceSpec];
92 auto exec = cfg[object::exec];
93
94 auto selector = onHost::makeDeviceSelector(deviceSpec);
95 if(!selector.isAvailable())
96 return;
97 onHost::concepts::Device auto device = selector.makeDevice(0);
98 onHost::Queue queue = device.makeQueue(queueKind::blocking);
99
100 std::array<float, 8u> hostValues{};
101 auto randomBuffer = onHost::allocLike(device, hostValues);
102
103 onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
104 queue.enqueue(frameSpec, KernelBundle{UniformRandomKernel{}, randomBuffer, 1234u});
105
106 onHost::memcpy(queue, hostValues, randomBuffer);
107 onHost::wait(queue);
108
109 float sum = 0.0f;
110 for(auto value : hostValues)
111 {
112 CHECK(value >= 0.0f);
113 CHECK(value < 1.0f);
114 sum += value;
115 }
116
117 CHECK(sum > 0.0f);
118 CHECK(sum < 8.0f);
119}
120
121TEMPLATE_LIST_TEST_CASE("tutorial random intervals", "[docs]", docs::test::TestBackends)
122{
123 auto cfg = TestType::makeDict();
124 auto deviceSpec = cfg[object::deviceSpec];
125 auto exec = cfg[object::exec];
126
127 auto selector = onHost::makeDeviceSelector(deviceSpec);
128 if(!selector.isAvailable())
129 return;
130 onHost::concepts::Device auto device = selector.makeDevice(0);
131 onHost::Queue queue = device.makeQueue(queueKind::blocking);
132
133 std::array<float, 16u> hostCo{};
134 std::array<float, 16u> hostOc{};
135 std::array<float, 16u> hostCc{};
136 std::array<float, 16u> hostOo{};
137
138 auto coBuffer = onHost::allocLike(device, hostCo);
139 auto ocBuffer = onHost::allocLike(device, hostOc);
140 auto ccBuffer = onHost::allocLike(device, hostCc);
141 auto ooBuffer = onHost::allocLike(device, hostOo);
142
143 onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, coBuffer.getExtents());
144 queue.enqueue(frameSpec, KernelBundle{IntervalExamplesKernel{}, coBuffer, ocBuffer, ccBuffer, ooBuffer, 999u});
145
146 onHost::memcpy(queue, hostCo, coBuffer);
147 onHost::memcpy(queue, hostOc, ocBuffer);
148 onHost::memcpy(queue, hostCc, ccBuffer);
149 onHost::memcpy(queue, hostOo, ooBuffer);
150 onHost::wait(queue);
151
152 for(size_t i = 0; i < hostCo.size(); ++i)
153 {
154 CHECK(hostCo[i] >= 0.0f);
155 CHECK(hostCo[i] < 1.0f);
156 CHECK(hostOc[i] > 0.0f);
157 CHECK(hostOc[i] <= 1.0f);
158 CHECK(hostCc[i] >= 0.0f);
159 CHECK(hostCc[i] <= 1.0f);
160 CHECK(hostOo[i] > 0.0f);
161 CHECK(hostOo[i] < 1.0f);
162 }
163}
164
165TEMPLATE_LIST_TEST_CASE("tutorial random normal distribution", "[docs]", docs::test::TestBackends)
166{
167 auto cfg = TestType::makeDict();
168 auto deviceSpec = cfg[object::deviceSpec];
169 auto exec = cfg[object::exec];
170
171 auto selector = onHost::makeDeviceSelector(deviceSpec);
172 if(!selector.isAvailable())
173 return;
174 onHost::concepts::Device auto device = selector.makeDevice(0);
175 onHost::Queue queue = device.makeQueue(queueKind::blocking);
176
177 std::array<float, 64u> hostValues{};
178 auto randomBuffer = onHost::allocLike(device, hostValues);
179
180 onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, randomBuffer.getExtents());
181 queue.enqueue(frameSpec, KernelBundle{NormalNoiseKernel{}, randomBuffer, 2025u, 5.0f, 2.0f});
182
183 onHost::memcpy(queue, hostValues, randomBuffer);
184 onHost::wait(queue);
185
186 float mean = std::accumulate(hostValues.begin(), hostValues.end(), 0.0f) / static_cast<float>(hostValues.size());
187 CHECK(mean > 4.0f);
188 CHECK(mean < 6.0f);
189
190 bool foundBelowMean = false;
191 bool foundAboveMean = false;
192 for(auto value : hostValues)
193 {
194 foundBelowMean = foundBelowMean || value < 5.0f;
195 foundAboveMean = foundAboveMean || value > 5.0f;
196 }
197 CHECK(foundBelowMean);
198 CHECK(foundAboveMean);
199}
200
201struct MonteCarloPiKernel
202{
203 ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc, concepts::IMdSpan auto hits, uint32_t seed)
204 const
205 {
206 auto const [threadIdxInGrid] = acc.getIdxWithin(alpaka::onAcc::origin::grid, alpaka::onAcc::unit::threads);
207 // globally unique seed created from a base seed and the thread index within the grid
208 rand::engine::Philox4x32x10 engine(seed + threadIdxInGrid);
209 auto uniform = rand::distribution::UniformReal{0.0f, 1.0f, rand::interval::co};
210
211 for(auto [idx] : onAcc::makeIdxMap(acc, onAcc::worker::threadsInGrid, IdxRange{hits.getExtents()}))
212 {
213 auto x = uniform(engine);
214 auto y = uniform(engine);
215 hits[idx] = (x * x + y * y <= 1.0f) ? 1u : 0u;
216 }
217 }
218};
219
220
221TEMPLATE_LIST_TEST_CASE("tutorial monte carlo pi", "[docs]", docs::test::TestBackends)
222{
223 auto cfg = TestType::makeDict();
224 auto deviceSpec = cfg[object::deviceSpec];
225 auto exec = cfg[object::exec];
226
227 auto selector = onHost::makeDeviceSelector(deviceSpec);
228 if(!selector.isAvailable())
229 return;
230 onHost::concepts::Device auto device = selector.makeDevice(0);
231 onHost::Queue queue = device.makeQueue(queueKind::blocking);
232
233 constexpr uint32_t numSamples = 16384u;
234 auto hitBuffer = onHost::alloc<uint32_t>(device, Vec{numSamples});
235 auto hitCountBuffer = onHost::alloc<uint32_t>(device, Vec{1u});
236 auto hostHitCount = onHost::allocHostLike(hitCountBuffer);
237
238 onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec, hitBuffer.getExtents());
239
240 queue.enqueue(frameSpec, KernelBundle{MonteCarloPiKernel{}, hitBuffer, 2026u});
241 onHost::reduce(queue, exec, 0u, hitCountBuffer, std::plus{}, hitBuffer);
242
243 onHost::memcpy(queue, hostHitCount, hitCountBuffer);
244 onHost::wait(queue);
245
246 auto estimatedPi = 4.0f * static_cast<float>(hostHitCount[0]) / static_cast<float>(numSamples);
247
248 CHECK(estimatedPi == Catch::Approx(3.14159f).margin(0.15f));
249}