Cheatsheet

Download pdf version here

General

Getting alpaka: https://github.com/alpaka-group/alpaka3
Issue tracker, questions, support: https://github.com/alpaka-group/alpaka3/issues
All alpaka names are in namespace alpaka and header file alpaka/alpaka.hpp

This document assumes

#include <alpaka/alpaka.hpp>
namespace myProject {
    using namespace alpaka;
}

Warning

Using using namespace alpaka; is global namespace should be avoided, due to possible side effects with other libraries.

All methods and classes in the alpaka namespace can be called from the cpu controller thread (named host) and from the compute device.

alpaka::onHost can only be called from host.
alpaka::onAcc can only be called from within a kernel running on the compute device.

Methods starting with onHost::make (e.g., onHost::makeHostDevice()) create handles to instances where the copy is only a shallow copy and not a deep copy. Methods starting with get (e.g., onHost::getExtents(…)) provide access to properties of an instance.

Accelerator, Platform and Device

Define in-kernel thread indexing type

// 1u, 2u, 3u, ...
constexpr uint32_t dim = 2u;
// uint32_t, size_t
using IdxType = size_t;
using DataType = int;

Usage of multi-dimensional vectors required for extents or indexing

// Use alpaka vector as a static array for the extents
concepts::Vector auto extent1D = Vec{value};
concepts::Vector auto extent2D = Vec{valueY, valueX};
// truly compile time known values
concepts::CVector auto extent3D = CVec<IdxType, valueZ, valueY, valueX>{};

Access components of and destructure multi-dimensional indices and extents

auto extentX = extent3D[0];
auto [z, y, x] = extent3D;

Linearize multi-dimensional vectors

std::integral auto linearIdx = linearize(extent3D, idx3D);

Map linear index to multi-dimensional index

concepts::Vector auto idxMd = mapToND(extent4D, scalar);

Available apis

api::host
api::cuda
api::hip
api::oneApi

Device kinds

deviceKind::cpu
deviceKind::amdGpu
deviceKind::nvidiaGpu
deviceKind::intelGpu

Executors

exec::cpuSerial
exec::cpuOmpBlocks
exec::cpuTbbBlocks
exec::gpuCuda
exec::gpuHip
exec::oneApi

Create device selector and select a device by index

auto devSelector = onHost::makeDeviceSelector(api, deviceKind);
if(devSelector.getDeviceCount() == 0)
    throw std::runtime_error("No device found!");
auto device = devSelector.makeDevice(index);

Queue and Events

Create a queue for a device

// default queue is non blocking
auto queue = device.makeQueue();
auto nonBlockingQueue = device.makeQueue(queueKind::nonBlocking);
auto blockingQueue = device.makeQueue(queueKind::blocking);

Put a task for execution

queue.enqueueHostFn(task);
queue.enqueueHostFnDeferred(task);

Wait for all operations in the queue

onHost::wait(queue);

Check if a queue is empty

bool isQueueEmpty = queue.isEmpty();

Create an event

auto event = device.makeEvent();

Put an event to the queue

queue.enqueue(event);

Check if the event is completed

event.isComplete();

Wait for the event (and all operations put to the same queue before it)

onHost::wait(event);

Memory

Memory allocation and transfers are symmetric for host and devices, both done via alpaka API

Allocate a shared buffer in host memory

// Allocate memory for the alpaka buffer, which is a dynamic 3-dimensional array
// Memory allocations support any dimensionality
concepts::IBuffer auto hostBuffer = onHost::allocHost<DataType>(extent3D);

Create a view to host memory represented by a pointer

auto extent = Vec{numElements};
DataType* ptr = externPtr;
concepts::IView auto hostView = makeView(api::host, ptr, extent);

Create a view to host std::vector

std::vector vec = std::vector<DataType>(42u);
// the api is not required, std::vector is assumed to be api::host
// a non owning view us usable within a kernel and on the host therefore no namespace 'onHost' is required
auto hostView = makeView(vec);

Create a view to host std::array

std::array array = std::array<DataType, 2>{42u, 23};
// call within host code: api::host is automatically assumed
concepts::IView auto hostView = makeView(array);
// call from within a cuda kernel: api::cuda is automatically assumed
concepts::IView auto deviceView = makeView(array);

Get a raw pointer to a view initialization, etc.

DataType* rawPtr = onHost::data(buffer);

Get the pitches of a view

// number of bytes to the next element along the pitch dimension
concepts::Vector auto bufferPitches = onHost::getPitches(buffer);

View initialization, etc.

// The buffer can have any dimensionality.
// Memory manipulation functions supporting views too.
// set all bytes to zero
onHost::memset(queue, buffer, uint8_t{0});
// element-wise fill with value
onHost::fill(queue, buffer, 42);

Allocate a buffer

// the allocation is providing a shared buffer which will be
// automatically freed if the last handle runs out of a life-time
concepts::IBuffer auto devBuffer = onHost::alloc<DataType>(device, extentMd);
// allocate memory which lives on the host but is accessible from the device too
concepts::IBuffer auto devMappedBuffer = onHost::allocMapped<DataType>(device, extentMd);
// allocate memory can be accessed from host and device (unified memory),
// the real location depends on the native backend e.g. CUDA, OneApi, ...
concepts::IBuffer auto devUnifiedBuffer = onHost::allocUnified<DataType>(device, extentMd);
// allocate memory that is accessible after it is processed in the queue
concepts::IBuffer auto devDeferredBuffer = onHost::allocDeferred<DataType>(queue, extentMd);
// allocate memory accessible from host
concepts::IBuffer auto hostBuffer = onHost::allocHost<DataType>(extentMd);
// Data will not be automatically freed, user must take care that
// the original data life-time is longer than the non-owning view.
concepts::IView auto devNonOwningView = devBuffer.getView();

Copy multidimensional buffer/view or span data

// Memory manipulation functions supporting views too.
onHost::memcpy(queue, dstBuffer, srcBuffer);
// Providing the extent is optional and allow partial copies.
onHost::memcpy(queue, dstBuffer, srcBuffer, extentMd);

Allocate a buffer with the same extents from a std::vector or std::array

// This allocLike + memcpy pattern is not specific to std::vector; it also works with std::array
// and with alpaka Buffers/Views.
// Construct a host container (here: std::vector) with arbitrary values.
std::vector vec(42u, DataType{10});
// Create a one-dimensional deviceBuffer with the same extent as 'vec'
auto deviceBuffer = alpaka::onHost::allocLike(device, vec);
// Copy host -> device directly from the vector into the allocated device buffer.
// Note: if the queue is asynchronous, ensure the source memory container stays alive until the copy
// completes.
alpaka::onHost::memcpy(blockingQueue, deviceBuffer, vec);

Kernel Execution

Manually set a kernel launch configuration

onHost::concepts::FrameSpec auto frameSpec = onHost::FrameSpec{numFramesMd, frameExtentMd};

Automatically select a valid kernel launch configuration

// Provides kernel start parameters sutable for the device and executor
onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, extentMd);
// DataType is used to optimize the kernel parameters for working on data of this type
onHost::concepts::FrameSpec auto simdFrameSpec
    = onHost::getSimdFrameSpec<DataType>(device, exec::anyExecutor, extentMd);

Kernel Implementation

Define a kernel as a C++ functor

ALPAKA_FN_ACC is required for kernels and functions called inside, acc is mandatory first parameter, its type is the template parameter. acc must be a constant reference.
struct MyKernel
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const&, [[maybe_unused]] auto... kernelArgs) const
    {
    }
};

Instantiate a kernel (does not launch it yet)

acc parameter of the kernel is provided automatically, does not need to be specified here
Kernel kernel{argumentsForConstructor};

Put the kernel for execution

// automatically deduct a fast executor for the given device
queue.enqueue(frameSpec, KernelBundle{kernel, kernelArgs...});
// or use a specific executor
auto executor = exec::cpuSerial;
queue.enqueue(
    onHost::FrameSpec{frameSpec.getNumFrames(), frameSpec.getFrameExtents(), executor},
    KernelBundle{kernel, kernelArgs...});

Access multi-dimensional indices and extents of blocks, threads, and elements

// origin: grid, block
// unit: blocks, threads
auto idxMd = acc.getIdxWithin(onAcc::origin::*, onAcc::unit::*);
auto extentMd = acc.getExtentsOf(onAcc::origin::*, alpaka::onAcc::unit::*);

Allocate static shared memory variable

// two-dimensional matrix with 4 columns, 3 rows with elements of the type float
concepts::IMdSpan auto sharedMdArray
    = alpaka::onAcc::declareSharedMdArray<float, alpaka::uniqueId()>(acc, CVec<uint32_t, 3, 4>{});
// or with a preprocessor unique id
concepts::IMdSpan auto sharedMdArray2
    = alpaka::onAcc::declareSharedMdArray<float, __COUNTER__>(acc, CVec<uint32_t, 3, 4>{});
// a single scalar
DataType scalar = alpaka::onAcc::declareSharedVar<float, alpaka::uniqueId()>(acc, CVec<uint32_t, 3, 4>{});

Get dynamic shared memory pool, requires the kernel to have a data member with the size in bytes

struct DynMemKernel
{
    uint32_t dynSharedMemBytes = 32u;

    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc) const
    {
        // Access within the kernel, it is a plain pointer.
        // You are responsible to guarantee in bounds accesses.
        [[maybe_unused]] DataType* dynS = onAcc::getDynSharedMem<DataType>(acc);
    }
};

Or must specialize a trait for the kernel

struct DynSharedMemTrait
{
    ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc) const
    {
        // Access within the kernel, it is a plain pointer.
        // You are responsible to guarantee in bounds accesses.
        [[maybe_unused]] int* dynS = onAcc::getDynSharedMem<int>(acc);
    }
};

// specialization within the host code
namespace alpaka::onHost::trait
{
    template<concepts::ThreadSpec T_ThreadSpec>
    struct BlockDynSharedMemBytes<DynSharedMemTrait, T_ThreadSpec>
    {
        BlockDynSharedMemBytes(DynSharedMemTrait const& kernel, T_ThreadSpec const& spec)
        {
            alpaka::unused(kernel, spec);
        }

        // the signature is very similar to the kernel operator() signature with the difference that no accelerator is
        // provided.
        uint32_t operator()([[maybe_unused]] auto const&... args) const
        {
            return 32;
        }
    };
} // namespace alpaka::onHost::trait

Synchronize threads of the same block

onAcc::syncBlockThreads(acc);

Atomic operations

// Operation: operation::Add, operation::Sub, operation::Min, operation::Max, operation::Exch,
//            operation::Inc, operation::Dec, operation::And, operation::Or, operation::Xor,
//            operation::Cas
using Operation = operation::Add;
auto result = atomicOp<Operation>(acc, ptr, 1);
// Also dedicated functions available, e.g.:
auto old = onAcc::atomicAdd(acc, ptr, 1);

Memory fences on block-, device- or system level (guarantees LoadLoad and StoreStore ordering)

// Scopes: All threads of the block, the device and the system(host and peer devices)
onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire);
onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release);
onAcc::memFence(acc, onAcc::scope::system, onAcc::order::acq_rel);

Math functions

[[maybe_unused]] auto sinValue = math::sin(argument);
[[maybe_unused]] auto cosValue = math::pow(base, exp);

Similar for other math functions.