Cheatsheet
Download pdf version here
General
Getting alpaka: https://github.com/alpaka-group/alpaka3
Issue tracker, questions, support: https://github.com/alpaka-group/alpaka3/issues
All alpaka names are in namespace alpaka and header file alpaka/alpaka.hpp
This document assumes
#include <alpaka/alpaka.hpp> namespace myProject { using namespace alpaka; }
Warning
Using using namespace alpaka; is global namespace should be avoided, due to possible side effects with other libraries.
All methods and classes in the alpaka namespace can be called from the cpu controller thread (named host) and from the compute device.
alpaka::onHost can only be called from host.
alpaka::onAcc can only be called from within a kernel running on the compute device.
Methods starting with onHost::make (e.g., onHost::makeHostDevice()) create handles to instances where the copy is only a shallow copy and not a deep copy. Methods starting with get (e.g., onHost::getExtents(…)) provide access to properties of an instance.
Accelerator, Platform and Device
Define in-kernel thread indexing type
// 1u, 2u, 3u, ... constexpr uint32_t dim = 2u; // uint32_t, size_t using IdxType = size_t; using DataType = int;
Usage of multi-dimensional vectors required for extents or indexing
// Use alpaka vector as a static array for the extents concepts::Vector auto extent1D = Vec{value}; concepts::Vector auto extent2D = Vec{valueY, valueX}; // truly compile time known values concepts::CVector auto extent3D = CVec<IdxType, valueZ, valueY, valueX>{};
Access components of and destructure multi-dimensional indices and extents
auto extentX = extent3D[0]; auto [z, y, x] = extent3D;
Linearize multi-dimensional vectors
std::integral auto linearIdx = linearize(extent3D, idx3D);
Map linear index to multi-dimensional index
concepts::Vector auto idxMd = mapToND(extent4D, scalar);
Available apis
api::host api::cuda api::hip api::oneApi
Device kinds
deviceKind::cpu deviceKind::amdGpu deviceKind::nvidiaGpu deviceKind::intelGpu
Executors
exec::cpuSerial exec::cpuOmpBlocks exec::cpuTbbBlocks exec::gpuCuda exec::gpuHip exec::oneApi
Create device selector and select a device by index
auto devSelector = onHost::makeDeviceSelector(api, deviceKind); if(devSelector.getDeviceCount() == 0) throw std::runtime_error("No device found!"); auto device = devSelector.makeDevice(index);
Queue and Events
Create a queue for a device
// default queue is non blocking auto queue = device.makeQueue(); auto nonBlockingQueue = device.makeQueue(queueKind::nonBlocking); auto blockingQueue = device.makeQueue(queueKind::blocking);
Put a task for execution
queue.enqueueHostFn(task); queue.enqueueHostFnDeferred(task);
Wait for all operations in the queue
onHost::wait(queue);
Check if a queue is empty
bool isQueueEmpty = queue.isEmpty();
Create an event
auto event = device.makeEvent();
Put an event to the queue
queue.enqueue(event);
Check if the event is completed
event.isComplete();
Wait for the event (and all operations put to the same queue before it)
onHost::wait(event);
Memory
Memory allocation and transfers are symmetric for host and devices, both done via alpaka API
Create a view to host memory represented by a pointer
auto extent = Vec{numElements}; DataType* ptr = externPtr; concepts::IView auto hostView = makeView(api::host, ptr, extent);
Create a view to host std::vector
std::vector vec = std::vector<DataType>(42u); // the api is not required, std::vector is assumed to be api::host // a non owning view us usable within a kernel and on the host therefore no namespace 'onHost' is required auto hostView = makeView(vec);
Create a view to host std::array
std::array array = std::array<DataType, 2>{42u, 23}; // call within host code: api::host is automatically assumed concepts::IView auto hostView = makeView(array); // call from within a cuda kernel: api::cuda is automatically assumed concepts::IView auto deviceView = makeView(array);
Get a raw pointer to a view initialization, etc.
DataType* rawPtr = onHost::data(buffer);
Get the pitches of a view
// number of bytes to the next element along the pitch dimension concepts::Vector auto bufferPitches = onHost::getPitches(buffer);
View initialization, etc.
// The buffer can have any dimensionality. // Memory manipulation functions supporting views too. // set all bytes to zero onHost::memset(queue, buffer, uint8_t{0}); // element-wise fill with value onHost::fill(queue, buffer, 42);
Allocate a buffer
// the allocation is providing a shared buffer which will be // automatically freed if the last handle runs out of a life-time concepts::IBuffer auto devBuffer = onHost::alloc<DataType>(device, extentMd); // allocate memory which lives on the host but is accessible from the device too concepts::IBuffer auto devMappedBuffer = onHost::allocMapped<DataType>(device, extentMd); // allocate memory can be accessed from host and device (unified memory), // the real location depends on the native backend e.g. CUDA, OneApi, ... concepts::IBuffer auto devUnifiedBuffer = onHost::allocUnified<DataType>(device, extentMd); // allocate memory that is accessible after it is processed in the queue concepts::IBuffer auto devDeferredBuffer = onHost::allocDeferred<DataType>(queue, extentMd); // allocate memory accessible from host concepts::IBuffer auto hostBuffer = onHost::allocHost<DataType>(extentMd); // Data will not be automatically freed, user must take care that // the original data life-time is longer than the non-owning view. concepts::IView auto devNonOwningView = devBuffer.getView();
Copy multidimensional buffer/view or span data
// Memory manipulation functions supporting views too. onHost::memcpy(queue, dstBuffer, srcBuffer); // Providing the extent is optional and allow partial copies. onHost::memcpy(queue, dstBuffer, srcBuffer, extentMd);
Allocate a buffer with the same extents from a std::vector or std::array
// This allocLike + memcpy pattern is not specific to std::vector; it also works with std::array // and with alpaka Buffers/Views. // Construct a host container (here: std::vector) with arbitrary values. std::vector vec(42u, DataType{10}); // Create a one-dimensional deviceBuffer with the same extent as 'vec' auto deviceBuffer = alpaka::onHost::allocLike(device, vec); // Copy host -> device directly from the vector into the allocated device buffer. // Note: if the queue is asynchronous, ensure the source memory container stays alive until the copy // completes. alpaka::onHost::memcpy(blockingQueue, deviceBuffer, vec);
Kernel Execution
Manually set a kernel launch configuration
onHost::concepts::FrameSpec auto frameSpec = onHost::FrameSpec{numFramesMd, frameExtentMd};
Automatically select a valid kernel launch configuration
// Provides kernel start parameters sutable for the device and executor onHost::concepts::FrameSpec auto frameSpec = onHost::getFrameSpec(device, exec::anyExecutor, extentMd); // DataType is used to optimize the kernel parameters for working on data of this type onHost::concepts::FrameSpec auto simdFrameSpec = onHost::getSimdFrameSpec<DataType>(device, exec::anyExecutor, extentMd);
Kernel Implementation
Define a kernel as a C++ functor
ALPAKA_FN_ACCis required for kernels and functions called inside,accis mandatory first parameter, its type is the template parameter.accmust be a constant reference.struct MyKernel { ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const&, [[maybe_unused]] auto... kernelArgs) const { } };
Instantiate a kernel (does not launch it yet)
acc parameter of the kernel is provided automatically, does not need to be specified here
Kernel kernel{argumentsForConstructor};
Put the kernel for execution
// automatically deduct a fast executor for the given device queue.enqueue(frameSpec, KernelBundle{kernel, kernelArgs...}); // or use a specific executor auto executor = exec::cpuSerial; queue.enqueue( onHost::FrameSpec{frameSpec.getNumFrames(), frameSpec.getFrameExtents(), executor}, KernelBundle{kernel, kernelArgs...});
Access multi-dimensional indices and extents of blocks, threads, and elements
// origin: grid, block // unit: blocks, threads auto idxMd = acc.getIdxWithin(onAcc::origin::*, onAcc::unit::*); auto extentMd = acc.getExtentsOf(onAcc::origin::*, alpaka::onAcc::unit::*);
Or must specialize a trait for the kernel
struct DynSharedMemTrait { ALPAKA_FN_ACC void operator()(onAcc::concepts::Acc auto const& acc) const { // Access within the kernel, it is a plain pointer. // You are responsible to guarantee in bounds accesses. [[maybe_unused]] int* dynS = onAcc::getDynSharedMem<int>(acc); } }; // specialization within the host code namespace alpaka::onHost::trait { template<concepts::ThreadSpec T_ThreadSpec> struct BlockDynSharedMemBytes<DynSharedMemTrait, T_ThreadSpec> { BlockDynSharedMemBytes(DynSharedMemTrait const& kernel, T_ThreadSpec const& spec) { alpaka::unused(kernel, spec); } // the signature is very similar to the kernel operator() signature with the difference that no accelerator is // provided. uint32_t operator()([[maybe_unused]] auto const&... args) const { return 32; } }; } // namespace alpaka::onHost::trait
Synchronize threads of the same block
onAcc::syncBlockThreads(acc);
Atomic operations
// Operation: operation::Add, operation::Sub, operation::Min, operation::Max, operation::Exch, // operation::Inc, operation::Dec, operation::And, operation::Or, operation::Xor, // operation::Cas using Operation = operation::Add; auto result = atomicOp<Operation>(acc, ptr, 1); // Also dedicated functions available, e.g.: auto old = onAcc::atomicAdd(acc, ptr, 1);
Memory fences on block-, device- or system level (guarantees LoadLoad and StoreStore ordering)
// Scopes: All threads of the block, the device and the system(host and peer devices) onAcc::memFence(acc, onAcc::scope::block, onAcc::order::acquire); onAcc::memFence(acc, onAcc::scope::device, onAcc::order::release); onAcc::memFence(acc, onAcc::scope::system, onAcc::order::acq_rel);
Math functions
[[maybe_unused]] auto sinValue = math::sin(argument); [[maybe_unused]] auto cosValue = math::pow(base, exp);
Similar for other math functions.