Pixel

This page describes the implementation of the Pixel backend in core/simulate/src/pixelsim.cpp, core/simulate/src/pixelsim_impl.cpp and core/simulate/src/cudapixelsim.cpp.

It intentionally avoids repeating the numerical derivations. For the PDE, units and discretization details, see Pixel simulator and Maths.

CPU backend

State layout

PixelSim builds two runtime object lists:

  • one SimCompartment for each simulated compartment

  • one SimMembrane for each membrane that has reactions and touches at least one simulated compartment

Each SimCompartment stores flattened voxel-major arrays, so the state for voxel ix and species is lives at ix * nSpecies + is in conc, dcdt and the Runge-Kutta work buffers.

If the model reactions depend on time and/or on the spatial coordinates, the extra variables are appended to every voxel state in this order:

  • time if needed

  • x, y, z if needed

That padding is created once in the constructor and then carried through the same symbolic evaluation path as the real species.

Compartment reactions

The compartment-local reaction path is built in two steps:

  1. ReacExpr constructs one right-hand-side expression per species from the model reactions in that compartment, plus any optional time/x/y/ z variables.

  2. SimCompartment compiles those expressions into a common::Symbolic object so that each voxel update becomes a direct sym.eval(...) call.

At runtime, PixelSim::calculateDcdt() evaluates each compartment in this order:

  1. evaluateReactionsAndDiffusion() or evaluateReactionsAndDiffusion_tbb()

  2. membrane contributions

  3. spatiallyAverageDcdt() for non-spatial species

  4. applyStorage() or applyStorage_tbb()

Inside SimCompartment, the reaction contribution is the first pass over the voxel state. evaluateReactions(begin, end) overwrites the corresponding slice of dcdt with the compiled reaction result for each voxel. The later diffusion and membrane passes add onto the same dcdt entries.

The diffusion pass is separate because it uses the compartment topology (up_x(), dn_x() and friends) rather than the symbolic reaction system. The implementation has two variants:

  • a fast uniform-diffusion stencil when every species has a uniform diffusion constant

  • a face-averaged conservative stencil when diffusion varies per voxel

If cross-diffusion is enabled, SimCompartment adds two extra passes:

  • evaluate the cross-diffusion coefficients from symbolic expressions

  • apply the cross-diffusion stencil and update the explicit stability bound

Zero-storage species reuse the same evaluation path. Before every Runge-Kutta stage, solveZeroStorageConstraints() repeatedly calls the compartment and membrane evaluators, then relaxes only the zero-storage species until the residual is below the configured error tolerances.

Membrane reactions

SimMembrane implements membrane reactions as per-face updates into the two adjacent compartments.

During construction it:

  • builds the variable list as [species in A][species in B][extra vars]

  • rescales the symbolic membrane reaction by volOverL3 so the compiled expression returns delta concentration * voxel length in the flux direction

  • compiles one common::Symbolic evaluator for the membrane

At runtime, evaluateMembranePairRange(...) processes a range of membrane face pairs:

  1. gather the concentrations from voxel ixA in compartment A and voxel ixB in compartment B into a temporary input buffer

  2. evaluate the compiled symbolic membrane expression

  3. add the resulting source terms into the two compartment dcdt arrays

The final division by voxel length is done per face direction at runtime:

  • XP and XM divide by the voxel width

  • YP and YM divide by the voxel height

  • ZP and ZM divide by the voxel depth

That is why the membrane evaluator iterates over six signed face-direction batches instead of using one merged pair list.

CPU parallelism

The CPU implementation uses TBB inside compartments and inside one membrane face-direction batch, but it does not currently parallelize every outer loop.

Parallel today:

  • SimCompartment::evaluateReactionsAndDiffusion_tbb() parallelizes the voxel loops for reaction evaluation, diffusion, cross-diffusion coefficient evaluation and cross-diffusion application.

  • applyStorage_tbb(), the Runge-Kutta update helpers and concentration clamping are parallelized over voxel ranges.

  • SimMembrane::evaluateReactions_tbb() parallelizes the pair loop for one signed face direction at a time.

Serial today:

  • PixelSim::calculateDcdt() iterates over compartments one after another.

  • membranes are evaluated one membrane after another

  • each membrane still processes XP, XM, YP, YM, ZP and ZM sequentially

  • updateCrossDiffusionMaxStableTimestep() is a serial reduction over the already-evaluated coefficient field

The important race-avoidance rule is that the CPU membrane path writes directly into the shared compartment dcdt buffers without atomics. Parallelism is therefore limited to batches where each worker writes to disjoint output rows. One signed face-direction batch has that property because a voxel can only have one neighbour in a given signed Cartesian direction. Different face directions do not have that guarantee, so they stay serial.

GPU backend

Current scope of the GPU backends

The GPU backend currently has two implementations with the same user-facing feature set:

  • CudaPixelSim on CUDA-enabled Linux and Windows systems

  • MetalPixelSim on Apple silicon macOS systems

Both GPU backends currently have stricter input requirements than the CPU implementation. Setup fails early if the selected model uses any of the following:

  • cross-diffusion

  • time-dependent reaction terms

  • spatially dependent reaction terms

  • non-spatial species

  • non-uniform diffusion constants

  • storage values other than 1

That reduced feature set is why the current GPU code paths can use a simpler state layout with no extra time or x/y/z padding.

Compartment kernels

For each simulated compartment, the GPU backend allocates:

  • host buffers for concentrations, dcdt and lower-order RK data

  • device buffers dConc, dDcdt, dLowerOrder and dOldConc

  • device buffers for neighbour indices and uniform diffusion coefficients

  • backend-specific execution state:

    • CUDA uses one non-blocking stream and one dcdtReady event per compartment

    • Metal uses MTL::Buffer allocations in shared storage and submits work through a single command queue

The symbolic reaction expressions are converted to backend source with SymEngine::cudacode(...) or SymEngine::metalcode(...) and compiled at runtime with NVRTC or the Metal runtime compiler.

The generated compartment kernels follow the same split as the CPU path:

  • reaction_kernel writes the per-voxel reaction RHS into dDcdt

  • diffusion_uniform_kernel adds the uniform-diffusion stencil to dDcdt

Membrane kernels

Each membrane also gets its own generated backend kernel. The setup step uploads one device array of voxel index pairs for every non-empty face direction and stores the inverse face length for that batch.

The generated membrane kernel:

  1. reads one face pair (ixA, ixB)

  2. gathers the species values from dConc in the two adjacent compartments

  3. evaluates the generated membrane reaction function

  4. uses floating-point atomics to accumulate the result into dDcdtA and dDcdtB

CUDA uses atomicAdd and Metal uses atomic_float with relaxed ordering. Because both backends use atomics, they can update shared membrane-adjacent voxels without staging the accumulation through the CPU.

The CUDA membrane launches are pre-assigned round-robin across the compartment streams. For each stream, membraneStreamDeps stores the compartments whose dcdtReady events must be satisfied before that stream can start its assigned membrane kernels. The current Metal implementation follows the same kernel split, but uses a simpler submit-and-wait command-buffer flow instead of stream/event scheduling.

CUDA step flow

The CUDA step flow tries to avoid synchronisation / waiting for kernels or transfers wherever possible. In particular, all compartment kernels run in parallel as they are independent of each other. The membrane kernels also run in parallel, with atomic updates since multiple kernels can modify the same voxel. Data transfer is non-blocking, so the next solver step can start while the results of the previous step are still transferring to the host.

Additional details:

  • evaluateDcdt() starts by recording mainStreamReady on the main compute stream. Each compartment stream waits on that event before launching its own reaction and diffusion kernels.

  • After the compartment kernels finish, each stream records dcdtReady. Streams assigned membrane work then wait only on the specific compartment events that their membrane batches touch.

  • Once all compartment streams have recorded their completion events, the main stream waits on those events and then launches the Runge-Kutta update, clamping and error-estimation kernels.

  • downloadStateToHost() records a computeDone event on the main stream, makes the dedicated transfer stream wait for it, then issues cuMemcpyDtoHAsync(...) copies for each compartment and records downloadComplete.

  • Future compute work calls ensureDownloadComplete() before writing to dConc again, so an in-flight download cannot race with the next kernel.

  • Rejected adaptive steps restore dOldConc into dConc with cuMemcpyDtoDAsync(...) on the main compute stream.

Metal step flow

The current Metal implementation keeps the same high-level kernel breakdown as CUDA, but uses a simpler execution strategy:

  • each solver stage builds one command buffer, commits it and waits for completion before the next stage

  • Apple silicon shared-memory buffers avoid explicit device-to-host copies, although concentrations and error buffers are still converted between host double storage and device float storage

  • this is intentionally less aggressive than the CUDA overlap/stream model, but keeps the control flow close enough that the solver logic stays aligned