PIF, Higher Order and Optimized Scatter/Gather, by PaulFisch · Pull Request #501 · IPPL-framework/ippl

PaulFisch · 2026-05-04T10:06:33Z

This PR adds the Particle-in-Fourier method to IPPL, together with restructured FFT, scatter/gather, particle layout, communication, and timing infrastructure. A list of changes:

Build system

New IPPL_ENABLE_FINUFFT (CPU finufft / GPU cufinufft via FetchContent)
and IPPL_ENABLE_CUFFTMP (cuFFTMp + NVSHMEM) build flags, both default OFF.
Kokkos default bumped to 5.0.0
cmake/AddIpplTest.cmake passes TEST_LINK_LIBS through to non-integration
tests.
cmake/AutoTunePresets.cmake, IpplAutoTunePresets.h.in, and pre-tuned
sweeps under cmake/auto_tune/sm_90/ ship the autotuner data alongside
the build (for new scatter/gather kernels).

FFT refactor

The old FFT.h/.hpp is split into a Backend layer
(Backend/{Backend,CuFFT,CuFFTMp,Heffte}.h) handling plan
management, and a Transform layer
(Transform/{CC,RC,PrunedCC,PrunedRC,Trig,NUFFT}.h) exposing per-transform
front-ends.
A new native NUFFT (Type-1 and Type-2) lives under src/FFT/NUFFT/:
- NativeNUFFT.h drives the upsampling-FFT pipeline.
- ESKernel.h is an exponential-of-semicircle spreading kernel with
  width-specialised optimized evaluation.
- Correction.h provides deconvolution factors, NUFFTUtilities.h
  provides helpers.
The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp}
and supports both the native engine and the (cu)FINUFFT backend behind
IPPL_ENABLE_FINUFFT.

Interpolation

New scatter dispatch in Interpolation/Scatter/: Scatter.h selects
between AtomicScatter, GridParallelScatter (output-focused), and
TiledScatter back-ends through ScatterConfig and a runtime
TileSizeCache loaded from the autotune presets.
Matching gather front-end (AtomicGather, AtomicSort variants) under
Interpolation/Gather/.
Binning.h bin-sorts particles by tile so hot tiles avoid atomic
contention. Kernels.h, WidthDispatcher.h, and CoordinateTransform.h
are shared with the native NUFFT.
AutoTune.{h,cpp} and TileSizeCache.cpp provide the runtime
cache-lookup that lets the same binary pick a different tile size on
different hardware.

Particle layout

ParticleSpatialLayout::update is rewritten from the per-iteration
vector-of-vectors model into a packed locate + single-pass send/recv
pipeline with reusable scratch (rankSendCount, sendOffsets, sendIds,
cursor). Buffers are reused across updates and host-side hashes are
avoided.
ParticleSort.h / SortBuffer.h provide a bin-sort with a shared
device-side buffer pool (BinSortBuffers) so repeated sorts in
scatter/gather hot paths reuse storage.
FieldLayout now stores nghost; HaloCells.{h,hpp} thread it through
accumulateHalo / fillHalo, enabling kernel-width-aware halos (>1
ghost cell). (Before this PR, nghosts>1 was broken in IPPL)
ParticleAttrib exposes scatterPIFNUFFT / gatherPIFNUFFT guarded by
IPPL_ENABLE_FFT.
ParticleAttrib::scatter(...) / ParticleAttrib::gather(...) (CIC overloads used by
every alpine PIC example) now
forward to the kernel-aware path with LinearKernel<MeshType> and a
default ScatterConfig / GatherConfig. PIC paths therefore inherit the
same atomic / atomic-sort / tiled / output-focused dispatch and
TileSizeCache lookup as NUFFT.

Communication and Particle Update

This PR removes a per-attribute buf_m send/recv staging view for every
ParticleAttrib. The old update path went
attribute view -> per-attribute buf_m -> MPI buffer → per-attribute buf_m->→ attribute view, with one buf_m per attribute per exchange and copies on
both ends. The new path serializes attribute data directly into a shared,
reused MPI buffer managed by a singleton BufferHandler, and
deserializes straight back into the destination attribute's storage at a
caller-supplied offset. The per-attribute buf_m is not needed for the common path.

ParticleAttrib::serialize / deserialize now operate on dview_m
directly and accept a hash view (gather-on-the-fly of the live
particle indices going to one rank) and a destination offset. There is
no intermediate per-attribute buffer to size, allocate, copy into, or
copy out of. getView() returns a subview over the live range only, so
attribute kernels never see the overallocated tail.
A granularity-aware GPU allocator is added to Archive'. Device
archives are allocated with raw cudaMalloc / hipMalloc so the
pointer is page-aligned. Host
archives keep using a regular Kokkos::View.

Utilities

IpplTimings.{h,cpp}: track per-call measurements alongside cumulative
wall time, support resetAllTimers() (for warmup), and add
dumpToCSV() for offline analysis. print() reports measurement counts
and per-occurrence statistics (mean, stddev, min, max).
BufferView.h: lightweight, type-erased view over a contiguous device
byte buffer, used by the new send/recv path.
Tuning.h: TileSizeTuner interface to Kokkos Tools autotuner that
enumerates tile-size candidates for the new scatter/gather back-ends.
ParallelDispatch.h: helpers for selecting parallel-for vs.
parallel-reduce policies based on view rank/extents and execution space.

Alpine

alpine/ElectrostaticPIF/{LandauDampingPIF, LandauDampingPIFPruned, BumponTailInstabilityPIF, PenningTrapPIF}.cpp: added as Alpine PIF examples.
alpine/ElectrostaticPIF/{ChargedParticlesPIF, ChargedParticlesPIFPruned}.hpp:
particle-bunch container specialised for the PIF schemes

Unit tests

FFT/NUFFT.cpp + NUFFTTestUtils.h: test matrix for NUFFT Type-1
and Type-2 (small / medium grid, with / without upsampling, three spread
methods, tolerance sweep at 1e-4 / 1e-7 / 1e-10, FINUFFT cross-check if compiled,
three gather methods).
FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the
native NUFFT against a direct DFT reference.
Interpolation/KernelGatherScatterTest.cpp: charge conservation,
periodicity, gather-of-constant-field, scatter/sort comparison,
roundtrip; covers Atomic, Tiled, OutputFocused
Interpolation/Binning.cpp: bin-sort correctness and stability tests.
Particle/ParticleUpdate.cpp: spatial-layout update tests under uniform
decomposition.
Particle/ParticleUpdateNonuniform.cpp: same for non-uniform domain
decompositions

Integration tests

test/FFT/TestFFTCC.cpp extended for the new pruned variants and the
heFFTe backend trait.
test/particle/benchmarkParticleUpdate.cpp switched to
Kokkos::Random_XorShift64_Pool device init (avoids host->device copies
between iterations) and adapted to the new ParticleSpatialLayout
signature.

- Top-level CMakeLists.txt: add option(IPPL_ENABLE_FINUFFT) and option(IPPL_ENABLE_CUFFTMP), both defaulting OFF. - cmake/Dependencies.cmake: add FetchContent block for (CU)FINUFFT and the cuFFTMp / NVSHMEM discovery block. Bump Kokkos default to 5.0.0 and enable Kokkos_ENABLE_COMPILE_AS_CMAKE_LANGUAGE. - cmake/InstallIppl.cmake: install finufft / cufinufft targets when built in-tree. - cmake/AddIpplTest.cmake: pass TEST_LINK_LIBS through to non-integration tests as well. - src/CMakeLists.txt: link Heffte/finufft/cufinufft only when FFT and the corresponding option are enabled; embed build metadata as an opt-in to reduce rebuild churn. - .gitignore: exclude .idea*, cmake-build-*.

- Split the original FFT.h/FFT.hpp into: - Backend/{Backend,CuFFT,CuFFTMp,Heffte,Interface}.h: per-library backend selection and plan management. - Transform/{CC,RC,PrunedCC,PrunedRC,Trig,Common,Transform, NUFFT.{h,hpp}}.h: per-transform front-ends (complex-complex, real-complex, pruned variants, trigonometric, NUFFT). - Traits.h: backend availability traits and selection helpers. - Add a native Kokkos NUFFT (Type-1 / Type-2) engine under FFT/NUFFT/: NativeNUFFT.h driving the upsampling-FFT pipeline, ESKernel.h exponential-of-semicircle spreading kernel with width-specialised Estrin polynomial evaluators, Correction.h deconvolution factors, Quadrature.h Gauss-Legendre quadrature, NUFFTUtilities.h helpers. - The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp} and supports both the native engine and the (cu)FINUFFT backend behind IPPL_ENABLE_FINUFFT.

- Add Scatter/{Scatter,GridParallelScatter,AtomicScatter,TiledScatter, ScatterConfig,ScatterArgumentsBase,TileSizeCache}.h: a configurable scatter front-end that selects between atomic, output-focused, and tiled grid-parallel back-ends, with an optional tile-size cache for hardware-specific tuning data loaded at runtime. - Add Gather/{Gather,AtomicGather,GatherConfig,GatherArgumentsBase}.h: matching gather front-end (Atomic / AtomicSort variants). - Add Binning.h: bin-sort utilities used to group particles by tile so scatter/gather kernels can avoid atomics on hot tiles. - Add Kernels.h: shared kernel-evaluation helpers consumed by the new scatter/gather kernels. - Add WidthDispatcher.h: compile-time switch from runtime kernel width to the corresponding template specialization. - Add CoordinateTransform.h: physical <-> grid coordinate helpers used by both scatter/gather and the native NUFFT. - CIC.hpp: minor signature alignment to the new kernel concept.

- Particle/ParticleSpatialLayout.{h,hpp}: rewrite the spatial layout's update path to use a packed locate + single-pass send/recv pipeline with reusable scratch (rankSendCount, sendOffsets, sendIds, cursor) instead of the per-iteration vector-of-vectors approach. Reuses buffers across updates and avoids host-side hashes for the common case. - Particle/ParticleSort.h, SortBuffer.h: bin-sort by spatial layout with a shared device-side buffer pool (BinSortBuffers) so repeated sorts in scatter/gather hot paths reuse storage. - Particle/ParticleAttrib.{h,hpp}, ParticleAttribBase.h, ParticleBase.{h, hpp}: thread the new sort/bin-aware scatter and gather entry points through the attribute interface; add scatterPIFNUFFT/gatherPIFNUFFT guarded by IPPL_ENABLE_FFT. - FieldLayout/{FieldLayout.h,FieldLayout.hpp,SubFieldLayout.hpp}: store nghost on the layout so the halo and neighbor lookup pick up the correct ghost width when more than one cell is needed. - Field/HaloCells.{h,hpp}: thread nghost through accumulateHalo and fillHalo so the kernel-width-aware halo path works.

- IpplTimings.{h,cpp}: track per-call measurements alongside the cumulative wall time, support resetAllTimers() (for warmup), and add dumpToCSV() for offline analysis. Print() now also reports measurement counts and per-occurrence statistics (mean, stddev, min, max). - BufferView.h: lightweight, type-erased view over a contiguous device byte buffer used by the new send/recv path. - Tuning.h: TileSizeTuner — Kokkos-tools-driven autotuner that enumerates tile-size candidates for the new scatter/gather back-ends. - ParallelDispatch.h: helpers for selecting parallel-for vs parallel-reduce policies based on view rank/extents and execution space. - TypeUtils.h: detail::is_complex_v / MultispaceContainer helpers used by the buffer-handler-aware logging path and by the field inner-product (Hermitian for complex T). - ParameterList.h, Timer.cpp, ViewUtils.h: small follow-on tweaks.

- Archive.{h,hpp}: extend the device-side serialize/deserialize path to handle Vector<T,Dim> + hash views (used by the new packed particle send/recv pipeline) and add a granularity-aware GPU buffer allocator. On HIP, the GPU page granularity (64 KiB on MI250X / MI300X) is required by HSA IPC for large-message GPU-aware MPI. - BufferHandler.hpp: round buffer requests up to a 4 KiB page so we do not churn the GPU-aware MPI registration cache with tiny allocations; allocate a fresh buffer on grow rather than realloc to avoid stale-IPC issues with reused virtual addresses. - Communicator.{h,cpp}: add archive-only isend/recv overloads (used by the new layout); switch buffer handlers to a static singleton via getBufferHandler() so all communicators share one pool. - CommunicatorLogging.cpp: gatherLocalLogs() now uses an is_a_logger SFINAE guard so we only call getLogs() on a LoggingBufferHandler. - Buffers.{cpp,hpp}: route deleteAllBuffers/freeAllBuffers/getBuffer through the singleton handler. - Environment.{h,cpp}: initialize MPI with MPI_THREAD_MULTIPLE and expose threadMultiple() so callers know whether multithreaded sends are safe.

- Field/BareField.{h,hpp}: drop the nghost default on initialize() / updateLayout() now that callers have to pick the correct width; use ippl::detail::infinity<T>() for the Vector min/max identities so reductions with integer T compile. - Field/BareFieldOperations.hpp: support complex-valued inner products via Hermitian conjugation and split allreduce of real/imag parts (std::plus<complex> is not a standard MPI Op). - Field/Field.hpp: small helper for the new scatter integration. - PoissonSolvers/FFT{Open,Periodic,TruncatedGreenPeriodic}PoissonSolver.h: pull heffteBackend from the new fft::HeffteBackend trait instead of FFT_t::heffteBackend. - Types/IpplTypes.h, Vector.hpp: expose detail::infinity<T> for both floating-point and integer T (used by the BareField reduction identities). - FEM/LagrangeSpace.hpp: simplify a parameter type (point_t alias). - Ippl.cpp: call detail::finalizeBinSortBuffers() on shutdown to release the new sort scratch. - Ippl.h: drop the implicit ParallelDispatch include; consumers that need it (HaloCells, ParticleSpatialLayout) include it directly.

- alpine/ElectrostaticPIF/{LandauDampingPIF,LandauDampingPIFPruned, BumponTailInstabilityPIF,PenningTrapPIF}.cpp: physics drivers using the new NUFFT engine to scatter charge onto Fourier modes and gather the resulting field at particle locations. - alpine/ElectrostaticPIF/{ChargedParticlesPIF,ChargedParticlesPIFPruned}.hpp: particle-bunch container specialized for the PIF schemes (NUFFT-based density and force evaluation, energy diagnostics). - alpine/ElectrostaticPIF/CMakeLists.txt: build the four executables; link (cu)finufft when IPPL_ENABLE_FINUFFT is on. - alpine/CMakeLists.txt: register ElectrostaticPIF as an alpine subdirectory. - alpine/{BumponTailInstabilityManager,LandauDampingManager, PenningTrapManager}.h, ParticleContainer.hpp: small adjustments picked up via the new ParticleSpatialLayout signatures.

…ticleUpdate - FFT/NUFFT.cpp + NUFFTTestUtils.h: TYPED_TEST matrix for NUFFT Type-1 and Type-2 (small/medium grid, with/without upsampling, three spread methods, tolerance sweep at 1e-4/1e-7/1e-10, FINUFFT cross-check, three gather methods). - FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the native NUFFT against a direct DFT reference. - Interpolation/KernelGatherScatterTest.cpp + Utils.h: charge conservation, periodicity, gather-of-constant-field, scatter/sort comparison, roundtrip; covers Atomic, Tiled, OutputFocused variants (incl. Z-batched). - Interpolation/Binning.cpp: bin-sort correctness and stability tests for the new bin_sort path. - Particle/ParticleUpdate.cpp: spatial-layout update tests under uniform decomposition. - Particle/ParticleUpdateNonuniform.cpp: same but for non-uniform domain decompositions to exercise the rebinning logic. - Particle/ParticleSendRecv.cpp: minor updates for the new Communicator API. - Wire all new tests into the CMakeLists.txts for unit_tests/{FFT, Interpolation,Particle}.

- test/FFT/TestFFTCC.cpp: extend the existing FFT correctness check to exercise the new pruned variants and the heFFTe backend trait. - test/particle/benchmarkParticleUpdate.cpp: switch warmup to Kokkos::Random_XorShift64_Pool device init (avoids host->device copies between iterations) and adjust to the new ParticleSpatialLayout signature. - test/particle/benchmarkParticleUpdateScaling.cpp: new strong-scaling benchmark that exercises ParticleSpatialLayout::update across a range of particle counts and ranks. - test/particle/CMakeLists.txt: register the new scaling benchmark. - test/{kokkos/TestVectorField{,2,3,4},field/TestFieldBC,particle/PICnd, serialization/serialize01}.cpp: minor signature touch-ups (host_mirror_type, isAllPeriodic flag, etc.) so the existing integration tests still build against the merged-in upstream and the new layout APIs. - test/CMakeLists.txt: minor adjustment to keep all subdirectories registering correctly under the new structure.

- Correction.h: applyPreCorrectionType2 now multiplies by factor (not conj(factor)), matching the working pruned path. - AtomicScatter.h: clamp team_size and vector_length to 1 on Serial. - BufferHandler test: use page-aligned expected sizes; resized larger test now expects the smaller free buffer to be retained. - NUFFT test: zero the non-corner-DC band of the upsampled input. - ParticleBase test: re-acquire host mirror after destroy().

- LandauDampingManager: open the manager CSV in OVERWRITE on the first step and APPEND afterwards, so re-running the test does not produce a file with multiple header blocks. - NUFFTAccuracy: call computeDFTReference before the owned-mode early- exit so every rank participates in the same MPI_Allreduce sequence. - GatherScatterTest::ScatterCustomHashTest: size the host fill loop by the post-update local count (getHostMirror() now returns a mirror of the live-particle range). - ParticleUpdateNonuniform::ParticleInjectionBetweenOrbRepartitions: the test creates a fresh bunch each cycle, so compare against injectPerCycle rather than a cumulative count. - unit_tests/FFT/CMakeLists.txt: bump NUFFTAccuracy timeout to 600s.

- CMakeLists.txt: add -fno-lto for the CUDA platform to avoid host-LTO conflicts with the duplicate fatbinData symbols nvcc emits. - AddIpplTest.cmake: introduce IPPL_DEFAULT_INTEGRATION_TIMEOUT=300 and use it for INTEGRATION tests (the unit-test default of 60s was too tight on slower OpenMP runs). - BareFieldOperations.hpp: pre-capture view1/view2 with (void)-casts so nvcc accepts the extended host/device lambda when the constexpr-if branches first-use them. - NUFFT.cpp: in runType2Test, compare absolute error against the L-infinity-norm-scaled tolerance instead of a pointwise relative error. The pointwise refReal can be arbitrarily close to zero for random fields, which made the test sensitive to the (rank, thread) layout that drives the random seeding. - Bump TIMEOUTs that were overshooting on multi-thread OpenMP: unit_tests/FFT NUFFTAccuracy (1800), Interpolation KernelGatherScatterTest (600), PIC ORB (300), test/solver/fem TestNonhomDirichlet_1d_preconditioned/TestPeriodicBC_sin{,_preconditioned} (600). Also fix the misplaced TIMEOUT keyword inside LABELS lists in the solver/fem CMakeLists.

…ives Archive<Properties...> previously took the cudaMalloc/hipMalloc path unconditionally whenever KOKKOS_ENABLE_CUDA or KOKKOS_ENABLE_HIP was defined. That meant Archive<Kokkos::HostSpace> — used by host-side exec spaces such as Kokkos::Serial / Kokkos::OpenMP — was backed by a device pointer, but its serialize() runs std::memcpy on the host and therefore segfaulted on the very first MPI exchange under a CUDA build with more than one rank. Switch the storage choice to depend on the archive's memory space rather than the build configuration. Device archives still get the raw cuda/hipMalloc path (preserving the page-aligned / IPC-stable behaviour needed for the Cray-MPICH HIP IPC bug); host-accessible archives use a regular Kokkos::View<char*, MemorySpace> and the host-side memcpy in serialize() now writes to a host pointer.

…work ParticleAttrib::scatter(...) and ParticleAttrib::gather(...) (the kernel-less, atomic-CIC overloads used by every alpine PIC example) now forward to the kernel-aware scatter_kernel/gather path with LinearKernel<MeshType> and a default ScatterConfig/GatherConfig. This gives the PIC paths the same atomic / atomic-sort / tiled / output- focused dispatch (and TileSizeCache lookup) as NUFFT, while keeping the legacy hash-and-custom-range overload available for TestHashedScatter / GatherScatterTest::ScatterCustomHashTest. Higher-dimensional fields (Dim > 3) still use the direct CIC kernel because the new framework's AtomicScatter/AtomicGather only have 1-, 2-, 3-D specialised paths. Interpolation::Gather/Scatter::DeduceTypes now derives RealType from the kernel's value_type instead of from the position attribute. With mixed precisions (float positions on a double mesh, e.g. PICTest with float T) this avoids silently downcasting the mesh spacing and preserves bit-identical results vs. the legacy CIC path (ASSERT_DOUBLE_EQ in PICTest::Gather is now exact again, no relaxation needed).

…t have tests

srikrrish · 2026-05-05T15:15:21Z

cscs-ci run cscs-ci-gh200, cscs-ci-mi300, cscs-ci-openmp

PaulFisch · 2026-05-05T15:23:39Z

I still have a commit fixing some of the compile failures on juwels, I will also address your comments @srikrrish

aaadelmann

Complexity is super high for one PR. The PR combines FFT backend refactor, NUFFT, scatter/gather redesign, particle migration rewrite, autotuning, timing changes, build-system changes, examples, and many tests. That makes review and regression attribution difficult.

Would it be possible to have the FFT as a separate PR?

aaadelmann · 2026-05-06T18:53:14Z

@@ -0,0 +1,610 @@
+// ChargedParticlesPIF header file


git diff --check reports trailing whitespace in ChargedParticlesPIF.hpp (line 535). There are also long lines and local comments like “Could definiitely be improved” in Scatter.h (line 121), which should not go into production code.

PaulFisch · 2026-05-06T19:35:33Z

Complexity is super high for one PR. The PR combines FFT backend refactor, NUFFT, scatter/gather redesign, particle migration rewrite, autotuning, timing changes, build-system changes, examples, and many tests. That makes review and regression attribution difficult.

Would it be possible to have the FFT as a separate PR?

Thanks for the feedback, and I do see the point about the PR being large. Before going through the individual pieces, some context that I hope is useful: I have put a considerable amount of time into preparing this PR, unpaid and without other compensation, upon specific and repeated request from your team. I was very glad to do so, but I do also have a job and a master thesis running in parallel. If there is no interest in merging the changes after all, I would really have appreciated hearing that sooner, before I spent tens of hours on restructuring.

Going through the components:

The FFT (where I assume the NUFFT is meant) relies quite directly on the Scatter/Gather redesign, as higher-order Scatter/Gather is at the core of a NUFFT. Without it, there is no NUFFT. The algorithm for width-1 (i.e. PIC) does not change. Since IPPL previously had no support for higher-order Scatter/Gather, it is unfortunately not possible to merge the NUFFT without also adding the Scatter/Gather work.
The FFT refactor itself can easily be undone by merging all the currently included headers in FFT.h back into a single ~2000-LOC file, since I did not significantly change the FFT code itself but only broke it up into pieces. Happy to do that if it would help review.
The particleUpdate is indeed separate and I can pull it out of the PR relatively easily. As shown in the plot above, it gives a substantial performance boost, and I also added a significant amount of tests to the particle update which surfaced a few minor bugs in the previous version. Happy to revert if that is preferred.
The Auto-Tuning feature is used solely for the higher-order Scatter/Gather kernels, which do not currently exist in IPPL. It could be removed, at the cost of noticeably decreased performance. It selects optimal scatter/gather strategies based on local particle density, and was something Sri was particularly interested in for imbalanced distributions.
I can remove the timing changes if desired. I would still mention that timings generated without warmup, which was the case before, can be quite misleading. For PIC, a significant share is spent in IPC buffer registration on Alps.
The build system changes are quite limited: only what is needed for the higher-order Scatter/Gather kernels, a Kokkos default bump to 5.0.0 (which I can revert), and the optional cuFINUFFT dependency. I can also remove the cuFINUFFT path entirely, though it does perform better than the alternative in some cases.
I can certainly remove tests from the PR, though I would push back on that one. Especially for a PR of this size, the test coverage should be kept.
Regarding the "this could be better" comments in Scatter.h, I am happy to remove them if preferred. To clarify what they are about: they refer to strategies for selecting between scatter algorithms, not to the algorithms themselves. Every algorithm in there is faster than what was in IPPL before. The comments are simply meant as hints for future contributors who might find even better selection heuristics, since I am sure better ones exist. If the view is that the code is not fast enough for IPPL without that further improvement, I would just note that by the same standard, the previous scatter/gather and particle update paths would also not have qualified. Either way, happy to take the comments out.

aaadelmann · 2026-05-06T20:04:00Z

I totally understand your boundary conditions. No one has said that we will not merge or that you should undo anything - but again it is simply a complexity problem. Complexity in reviewing and testing. Consider my comment on breaking up as "resolved". If we really need this we can do the split up the merge my ourself.

srikrrish · 2026-05-07T09:07:02Z

Currently the Jülich's CI/CD have some token issues (in the final results pushing to the web page) and me and my colleague are looking into it. But I was able to see the results of the tests directly in gitlab and they were all passing. I now see that CSCS's CI is also passing. @PaulFisch We greatly appreciate your work and time on this and in case you find a bit of time to improve the documentation that would be great. But otherwise also we can understand.

@aliemen Could you may be pull this branch and check the open solver or whatever is relevant for you/OPALX? @aaadelmann Could you may be run some tests on LUMI as you mentioned before?

Since Paul has already posted results on Alps, we will be then good to merge in my opinion.

PaulFisch · 2026-05-07T11:13:30Z

I can perform some LUMI scalings, I already have the binaries compiled there. It would probably indeed be good to check some applications before merging this.

srikrrish · 2026-05-07T12:36:06Z

I can perform some LUMI scalings, I already have the binaries compiled there. It would probably indeed be good to check some applications before merging this.

That would be great. Thanks @PaulFisch . As I mentioned before @aliemen and/or @aaadelmann can check some OPALX relevant stuffs I think.

aliemen · 2026-05-07T12:41:27Z

@srikrrish yes, I will compile OPALX against the branch and try to run our regression tests. Will come back with the results probably later today or tomorrow morning.

aaadelmann · 2026-05-07T12:50:03Z

Sonali will run all her Solver tests also on high node counts

PaulFisch · 2026-05-07T13:30:21Z

I also just found a bug in both my branch and the master with the overallocation being cast to integers everywhere. I have used in the alps PIC run a non-integer overallocation for both master and my branch. I will push that together with the documentation and add an updated scaling here when the runs finish.

On LUMI, both master and my branch segfault after a few iterations. I am unsure if this is due to FS issues today (https://lumi-supercomputer.eu/lumi-service-status/information-lustre-filesystem-performance/), the heffte issue, OOM, or something else. I am trying to triage.

Happy to help/discuss in case there is regressions anywhere and we need to split up the PR in some way. I had conflicting meetings on Tuesdays the last weeks unfortnately, but also happy to set up a meeting if required.

aliemen · 2026-05-07T14:08:22Z

+  # Bake into a generated header consumed by TileSizeCache / GatherCache.
+  set(IPPL_AUTOTUNE_PRESET_DIR "${_dst_dir}")
+  set(IPPL_AUTOTUNE_ARCH_TAG "${_tag}")
+  configure_file(
+    "${CMAKE_SOURCE_DIR}/cmake/IpplAutoTunePresets.h.in"
+    "${CMAKE_BINARY_DIR}/include/IpplAutoTunePresets.h"
+    @ONLY)


This breaks when IPPL is consumed via FetchContent from OPALX because CMAKE_SOURCE_DIR points to the top-level OPALX source directory, not the IPPL source directory. In that case AutoTunePresets.cmake tries to read:

${CMAKE_SOURCE_DIR}/cmake/IpplAutoTunePresets.h.in

which resolves to opalx/cmake/IpplAutoTunePresets.h.in, but the template actually lives in IPPL's own cmake/ directory.

A fix is to resolve paths relative to AutoTunePresets.cmake itself, e.g.

set(_ippl_cmake_dir "${CMAKE_CURRENT_FUNCTION_LIST_DIR}") set(_src_dir "${_ippl_cmake_dir}/auto_tune/${_tag}") configure_file( "${_ippl_cmake_dir}/IpplAutoTunePresets.h.in" "${CMAKE_BINARY_DIR}/include/IpplAutoTunePresets.h" @ONLY) ...

With this suggestion, the cmake runs fine. Still doesn't compile - but that's the Kokkos 5 issue. Will comment again when I succeeded.

I will add that. Is OpalX incompatible with Kokkos 5.x, or is there something specific that I do with it? I think I am not using any kokkos 5.x features, I should be able to bump down the version easily

That is, I was bumping the version because Kokkos 5.x improves performance on some relevant parts (in particular atomics), so long-term it is probably nice to have Kokkos 5.x in OPALX, but it should not block merging this, I can just bump down the version again

You're right, I thought most of the other errors where Kokko-5, but it works with Kokkos 4. With my fixes below (using size_type everywhere) I was able to compile it. OPALX still crashed, but I am looking into it.

aliemen · 2026-05-07T14:20:22Z

+        using execution_space = typename Types::execution_space;
+
+        struct Arguments : GatherArgumentsBase<Arguments, Types> {
+            using PermuteView = Kokkos::View<uint64_t*, memory_space>;


In some places, Kokkos::View<uint64_t*, memory_space> and Kokkos::View<ippl::detail::size_type*, memory_space> are mixed (maybe that's just a problem, because OPALX uses size_type more often...). I think it would be nice to use size_type everywhere.

On macOS this fails because std::size_t is unsigned long, while uint64_t is unsigned long long, so these are different Kokkos view value types and cannot be assigned to each other.

I will try to address. The problem is that size_type is not size_t, but on some exec spaces a 32-bit integer, and the particle number can easily exceed maxint.

Ah, but the uint64_t is definitely a bug, it should be size_t everywhere for wider int.

Actually, do you mean ippl::size_type or the Kokkos size_type that is a typedef of the exeuction spaces? I can try to use uniform ippl::size_type, that is surely the cleanest solution, that is also probably what you meant.

Ah yes, I was talking about ippl's size_type. In OPALX we aliased something like using size_type = ippl::detail::size_type, so I was talking about that one.

aliemen · 2026-05-07T15:56:13Z

Alright, after the size_type fix, the cmake path fix and the R.extent(0) vs R.size() fix, I was able to compile and run OPALX. Then, I had to fix a unit test that couldn't handle the temporary subview returned by getView - but that was an OPALX problem again.

Now, all of our unit and regression tests passed cleanly on CPU and GPU (GH200 on Daint). Our regression tests only test single rank. But from my local runs, all of them looked good with up to 8 ranks (CPU).

Note that I pushed a branch here to IPPL, test-pr-501, that I used to compile OPALX against where I applied my fixes. Feel free to look at it if you want to see what would work with OPALX. Changes I made to OPALX are in 412-..., will also make PR as soon as possible.

s-mayani · 2026-05-08T08:27:41Z

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf
TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

srikrrish · 2026-05-08T08:36:35Z

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

s-mayani · 2026-05-08T08:40:52Z

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

srikrrish · 2026-05-08T08:43:17Z

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

Could you may be just submit the run corresponding to that data point only to see if it may be some one time effect?

PaulFisch · 2026-05-08T09:04:13Z

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

srikrrish · 2026-05-08T09:13:37Z

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

Thanks a lot. Seems something that @paubauer may be able to help. The standalone reproducer could be quite useful for him. No need to spend a lot of time on this especially if it seems to lead to a rabbit hole.

s-mayani · 2026-05-08T09:16:35Z

I launched the job for that specific data point, it's in the queue. As for the heffte issue, seems related to Issue #481 .

aliemen · 2026-05-08T11:30:47Z

+                    if constexpr (Types::KernelType::has_width_template)
+                        kw_ptr[D * W + i] = args.kernel.template eval<W>(xi);
+                    else
+                        kw_ptr[D * W + i] = args.kernel(xi);


For some reason, this is a problem with cuda 12.1 and gcc 12 on Gwendolen. I cannot compile OPALX against it:

.../cuda-ampere80/build/_deps/ippl-src/src/Interpolation/Scatter/AtomicScatter.h(218): error: class "ippl::Interpolation::LinearKernel<double>" has no member "eval" kw_ptr[D * W + i] = args.kernel.template eval<W>(xi); ^

CUDA 12.1 with GCC 12 is diagnosing the discarded eval<W> expression inside a Kokkos device lambda/team kernel. Newer CUDA, like 12.9 with GCC 14, apparently handles that template/device-lambda case correctly...

We could either find a workaround (like making all interpolation kernels provide a harmless width-templated wrapper) or we say "doesn't work with cuda/12.1" and ignore it. But that would mean we need a new module on Gwendolen (which is the case for Kokkos 5 anyways). @aaadelmann what's your opinion?

There were still a lot of compiler bugs with this kind of expressions in cuda 12.9, probably it's not too difficult to build a workaround on my side, I will check.

s-mayani · 2026-05-08T13:01:59Z

The new run still produced a similar timing, with the PCG flattening at 512 GPUs.

aaadelmann · 2026-05-08T21:02:38Z

@s-mayani if you still have time until departure, send me all timing files.

PaulFisch · 2026-05-11T06:44:51Z

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

srikrrish · 2026-05-11T07:26:14Z

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

More iterations to converge seems to be the most concerning thing for me. @s-mayani Do you know whether you are also seeing the same behaviour in your tests?

PaulFisch · 2026-05-11T07:36:18Z

I should clarify that I did not actually see my branch needing more iterations to converge. I was just speculating what might be the issue for the difference in performance. If it is the mallocs, the cleaner way is likely to slightly refactor the PCG not to malloc in every iteration of the CG, as that would give a massive performance boost. Given that I did not observe a slow down within an iteration in my tests, I just stated that theoretically it could be that my branch needs more iterations, which would also explain longer runtime. That would be something that I would need to investigate, if that would be the case.

aaadelmann · 2026-05-11T09:11:30Z

If you could paste the timing-profiles into this PR, I will make an issue out of it.

PaulFisch added 15 commits May 4, 2026 11:40

aaadelmann requested review from aaadelmann, aliemen and srikrrish and removed request for aliemen May 4, 2026 10:31

aaadelmann added enhancement New feature or request gitlab-mirror labels May 4, 2026

Fix compilation problems in files that are only in examples and do no…

2702034

…t have tests

srikrrish requested changes May 5, 2026

View reviewed changes

Comment thread alpine/ParticleContainer.hpp Outdated

Comment thread cmake/Dependencies.cmake

Comment thread src/FFT/Backend/Heffte.h Outdated

Comment thread src/FFT/NUFFT/ESKernel.h

Comment thread unit_tests/FFT/NUFFT.cpp Outdated

PaulFisch added 5 commits May 5, 2026 20:39

Fix diverse compilation problems and address Sri review

540d9f7

Add back BO

9bbe077

try fixing BO

9ce5a76

update sm90 sweep

f65989c

landau warmup

6677a53

aaadelmann self-assigned this May 6, 2026

aaadelmann requested changes May 6, 2026

View reviewed changes

aliemen reviewed May 7, 2026

View reviewed changes

aliemen mentioned this pull request May 7, 2026

OPALX does not yet work with Kokkos 5 OPALX-project/OPALX#412

Closed

aliemen reviewed May 7, 2026

View reviewed changes

Comment thread src/Particle/ParticleAttrib.h

aliemen mentioned this pull request May 8, 2026

Proof OPALX for Kokkos 5 and IPPLs PIF PR OPALX-project/OPALX#414

Merged

aliemen reviewed May 8, 2026

View reviewed changes

Conversation

PaulFisch commented May 4, 2026

Build system

FFT refactor

Interpolation

Particle layout

Communication and Particle Update

Utilities

Alpine

Unit tests

Integration tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srikrrish commented May 5, 2026

Uh oh!

PaulFisch commented May 5, 2026

Uh oh!

aaadelmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulFisch commented May 6, 2026

Uh oh!

aaadelmann commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srikrrish commented May 7, 2026

Uh oh!

PaulFisch commented May 7, 2026

Uh oh!

srikrrish commented May 7, 2026

Uh oh!

aliemen commented May 7, 2026

Uh oh!

aaadelmann commented May 7, 2026

Uh oh!

PaulFisch commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aliemen May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aliemen commented May 7, 2026

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

srikrrish commented May 8, 2026

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

srikrrish commented May 8, 2026

Uh oh!

PaulFisch commented May 8, 2026

Uh oh!

srikrrish commented May 8, 2026

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

aaadelmann commented May 6, 2026 •

edited

Loading

aliemen May 8, 2026 •

edited

Loading