Pixel ===== This page describes the implementation of the Pixel backend in ``core/simulate/src/pixelsim.cpp``, ``core/simulate/src/pixelsim_impl.cpp`` and ``core/simulate/src/cudapixelsim.cpp``. It intentionally avoids repeating the numerical derivations. For the PDE, units and discretization details, see :doc:`../reference/pixel` and :doc:`../reference/maths`. CPU backend ----------- State layout ^^^^^^^^^^^^ ``PixelSim`` builds two runtime object lists: * one ``SimCompartment`` for each simulated compartment * one ``SimMembrane`` for each membrane that has reactions and touches at least one simulated compartment Each ``SimCompartment`` stores flattened voxel-major arrays, so the state for voxel ``ix`` and species ``is`` lives at ``ix * nSpecies + is`` in ``conc``, ``dcdt`` and the Runge-Kutta work buffers. If the model reactions depend on time and/or on the spatial coordinates, the extra variables are appended to every voxel state in this order: * ``time`` if needed * ``x``, ``y``, ``z`` if needed That padding is created once in the constructor and then carried through the same symbolic evaluation path as the real species. Compartment reactions ^^^^^^^^^^^^^^^^^^^^^ The compartment-local reaction path is built in two steps: 1. ``ReacExpr`` constructs one right-hand-side expression per species from the model reactions in that compartment, plus any optional ``time``/``x``/``y``/ ``z`` variables. 2. ``SimCompartment`` compiles those expressions into a ``common::Symbolic`` object so that each voxel update becomes a direct ``sym.eval(...)`` call. At runtime, ``PixelSim::calculateDcdt()`` evaluates each compartment in this order: 1. ``evaluateReactionsAndDiffusion()`` or ``evaluateReactionsAndDiffusion_tbb()`` 2. membrane contributions 3. ``spatiallyAverageDcdt()`` for non-spatial species 4. ``applyStorage()`` or ``applyStorage_tbb()`` Inside ``SimCompartment``, the reaction contribution is the first pass over the voxel state. ``evaluateReactions(begin, end)`` overwrites the corresponding slice of ``dcdt`` with the compiled reaction result for each voxel. The later diffusion and membrane passes add onto the same ``dcdt`` entries. The diffusion pass is separate because it uses the compartment topology (``up_x()``, ``dn_x()`` and friends) rather than the symbolic reaction system. The implementation has two variants: * a fast uniform-diffusion stencil when every species has a uniform diffusion constant * a face-averaged conservative stencil when diffusion varies per voxel If cross-diffusion is enabled, ``SimCompartment`` adds two extra passes: * evaluate the cross-diffusion coefficients from symbolic expressions * apply the cross-diffusion stencil and update the explicit stability bound Zero-storage species reuse the same evaluation path. Before every Runge-Kutta stage, ``solveZeroStorageConstraints()`` repeatedly calls the compartment and membrane evaluators, then relaxes only the zero-storage species until the residual is below the configured error tolerances. Membrane reactions ^^^^^^^^^^^^^^^^^^ ``SimMembrane`` implements membrane reactions as per-face updates into the two adjacent compartments. During construction it: * builds the variable list as ``[species in A][species in B][extra vars]`` * rescales the symbolic membrane reaction by ``volOverL3`` so the compiled expression returns ``delta concentration * voxel length in the flux direction`` * compiles one ``common::Symbolic`` evaluator for the membrane At runtime, ``evaluateMembranePairRange(...)`` processes a range of membrane face pairs: 1. gather the concentrations from voxel ``ixA`` in compartment A and voxel ``ixB`` in compartment B into a temporary input buffer 2. evaluate the compiled symbolic membrane expression 3. add the resulting source terms into the two compartment ``dcdt`` arrays The final division by voxel length is done per face direction at runtime: * ``XP`` and ``XM`` divide by the voxel width * ``YP`` and ``YM`` divide by the voxel height * ``ZP`` and ``ZM`` divide by the voxel depth That is why the membrane evaluator iterates over six signed face-direction batches instead of using one merged pair list. CPU parallelism ^^^^^^^^^^^^^^^ The CPU implementation uses TBB inside compartments and inside one membrane face-direction batch, but it does not currently parallelize every outer loop. Parallel today: * ``SimCompartment::evaluateReactionsAndDiffusion_tbb()`` parallelizes the voxel loops for reaction evaluation, diffusion, cross-diffusion coefficient evaluation and cross-diffusion application. * ``applyStorage_tbb()``, the Runge-Kutta update helpers and concentration clamping are parallelized over voxel ranges. * ``SimMembrane::evaluateReactions_tbb()`` parallelizes the pair loop for one signed face direction at a time. Serial today: * ``PixelSim::calculateDcdt()`` iterates over compartments one after another. * membranes are evaluated one membrane after another * each membrane still processes ``XP``, ``XM``, ``YP``, ``YM``, ``ZP`` and ``ZM`` sequentially * ``updateCrossDiffusionMaxStableTimestep()`` is a serial reduction over the already-evaluated coefficient field The important race-avoidance rule is that the CPU membrane path writes directly into the shared compartment ``dcdt`` buffers without atomics. Parallelism is therefore limited to batches where each worker writes to disjoint output rows. One signed face-direction batch has that property because a voxel can only have one neighbour in a given signed Cartesian direction. Different face directions do not have that guarantee, so they stay serial. GPU backend ----------- Current scope of the GPU backends ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The GPU backend currently has two implementations with the same user-facing feature set: * ``CudaPixelSim`` on CUDA-enabled Linux and Windows systems * ``MetalPixelSim`` on Apple silicon macOS systems Both GPU backends currently have stricter input requirements than the CPU implementation. Setup fails early if the selected model uses any of the following: * cross-diffusion * time-dependent reaction terms * spatially dependent reaction terms * non-spatial species * non-uniform diffusion constants * storage values other than ``1`` That reduced feature set is why the current GPU code paths can use a simpler state layout with no extra ``time`` or ``x``/``y``/``z`` padding. Compartment kernels ^^^^^^^^^^^^^^^^^^^ For each simulated compartment, the GPU backend allocates: * host buffers for concentrations, ``dcdt`` and lower-order RK data * device buffers ``dConc``, ``dDcdt``, ``dLowerOrder`` and ``dOldConc`` * device buffers for neighbour indices and uniform diffusion coefficients * backend-specific execution state: * CUDA uses one non-blocking stream and one ``dcdtReady`` event per compartment * Metal uses ``MTL::Buffer`` allocations in shared storage and submits work through a single command queue The symbolic reaction expressions are converted to backend source with ``SymEngine::cudacode(...)`` or ``SymEngine::metalcode(...)`` and compiled at runtime with NVRTC or the Metal runtime compiler. The generated compartment kernels follow the same split as the CPU path: * ``reaction_kernel`` writes the per-voxel reaction RHS into ``dDcdt`` * ``diffusion_uniform_kernel`` adds the uniform-diffusion stencil to ``dDcdt`` Membrane kernels ^^^^^^^^^^^^^^^^ Each membrane also gets its own generated backend kernel. The setup step uploads one device array of voxel index pairs for every non-empty face direction and stores the inverse face length for that batch. The generated membrane kernel: 1. reads one face pair ``(ixA, ixB)`` 2. gathers the species values from ``dConc`` in the two adjacent compartments 3. evaluates the generated membrane reaction function 4. uses floating-point atomics to accumulate the result into ``dDcdtA`` and ``dDcdtB`` CUDA uses ``atomicAdd`` and Metal uses ``atomic_float`` with relaxed ordering. Because both backends use atomics, they can update shared membrane-adjacent voxels without staging the accumulation through the CPU. The CUDA membrane launches are pre-assigned round-robin across the compartment streams. For each stream, ``membraneStreamDeps`` stores the compartments whose ``dcdtReady`` events must be satisfied before that stream can start its assigned membrane kernels. The current Metal implementation follows the same kernel split, but uses a simpler submit-and-wait command-buffer flow instead of stream/event scheduling. CUDA step flow ^^^^^^^^^^^^^^ The CUDA step flow tries to avoid synchronisation / waiting for kernels or transfers wherever possible. In particular, all compartment kernels run in parallel as they are independent of each other. The membrane kernels also run in parallel, with atomic updates since multiple kernels can modify the same voxel. Data transfer is non-blocking, so the next solver step can start while the results of the previous step are still transferring to the host. Additional details: * ``evaluateDcdt()`` starts by recording ``mainStreamReady`` on the main compute stream. Each compartment stream waits on that event before launching its own reaction and diffusion kernels. * After the compartment kernels finish, each stream records ``dcdtReady``. Streams assigned membrane work then wait only on the specific compartment events that their membrane batches touch. * Once all compartment streams have recorded their completion events, the main stream waits on those events and then launches the Runge-Kutta update, clamping and error-estimation kernels. * ``downloadStateToHost()`` records a ``computeDone`` event on the main stream, makes the dedicated transfer stream wait for it, then issues ``cuMemcpyDtoHAsync(...)`` copies for each compartment and records ``downloadComplete``. * Future compute work calls ``ensureDownloadComplete()`` before writing to ``dConc`` again, so an in-flight download cannot race with the next kernel. * Rejected adaptive steps restore ``dOldConc`` into ``dConc`` with ``cuMemcpyDtoDAsync(...)`` on the main compute stream. Metal step flow ^^^^^^^^^^^^^^^ The current Metal implementation keeps the same high-level kernel breakdown as CUDA, but uses a simpler execution strategy: * each solver stage builds one command buffer, commits it and waits for completion before the next stage * Apple silicon shared-memory buffers avoid explicit device-to-host copies, although concentrations and error buffers are still converted between host ``double`` storage and device ``float`` storage * this is intentionally less aggressive than the CUDA overlap/stream model, but keeps the control flow close enough that the solver logic stays aligned