Skip to content

PIF, Higher Order and Optimized Scatter/Gather, #501

Open
PaulFisch wants to merge 34 commits into
IPPL-framework:masterfrom
PaulFisch:pif-pr
Open

PIF, Higher Order and Optimized Scatter/Gather, #501
PaulFisch wants to merge 34 commits into
IPPL-framework:masterfrom
PaulFisch:pif-pr

Conversation

@PaulFisch
Copy link
Copy Markdown
Collaborator

This PR adds the Particle-in-Fourier method to IPPL, together with restructured FFT, scatter/gather, particle layout, communication, and timing infrastructure. A list of changes:

Build system

  • New IPPL_ENABLE_FINUFFT (CPU finufft / GPU cufinufft via FetchContent)
    and IPPL_ENABLE_CUFFTMP (cuFFTMp + NVSHMEM) build flags, both default OFF.
  • Kokkos default bumped to 5.0.0
  • cmake/AddIpplTest.cmake passes TEST_LINK_LIBS through to non-integration
    tests.
  • cmake/AutoTunePresets.cmake, IpplAutoTunePresets.h.in, and pre-tuned
    sweeps under cmake/auto_tune/sm_90/ ship the autotuner data alongside
    the build (for new scatter/gather kernels).

FFT refactor

  • The old FFT.h/.hpp is split into a Backend layer
    (Backend/{Backend,CuFFT,CuFFTMp,Heffte}.h) handling plan
    management, and a Transform layer
    (Transform/{CC,RC,PrunedCC,PrunedRC,Trig,NUFFT}.h) exposing per-transform
    front-ends.
  • A new native NUFFT (Type-1 and Type-2) lives under src/FFT/NUFFT/:
    • NativeNUFFT.h drives the upsampling-FFT pipeline.
    • ESKernel.h is an exponential-of-semicircle spreading kernel with
      width-specialised optimized evaluation.
    • Correction.h provides deconvolution factors, NUFFTUtilities.h
      provides helpers.
  • The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp}
    and supports both the native engine and the (cu)FINUFFT backend behind
    IPPL_ENABLE_FINUFFT.

Interpolation

  • New scatter dispatch in Interpolation/Scatter/: Scatter.h selects
    between AtomicScatter, GridParallelScatter (output-focused), and
    TiledScatter back-ends through ScatterConfig and a runtime
    TileSizeCache loaded from the autotune presets.
  • Matching gather front-end (AtomicGather, AtomicSort variants) under
    Interpolation/Gather/.
  • Binning.h bin-sorts particles by tile so hot tiles avoid atomic
    contention. Kernels.h, WidthDispatcher.h, and CoordinateTransform.h
    are shared with the native NUFFT.
  • AutoTune.{h,cpp} and TileSizeCache.cpp provide the runtime
    cache-lookup that lets the same binary pick a different tile size on
    different hardware.

Particle layout

  • ParticleSpatialLayout::update is rewritten from the per-iteration
    vector-of-vectors model into a packed locate + single-pass send/recv
    pipeline with reusable scratch (rankSendCount, sendOffsets, sendIds,
    cursor). Buffers are reused across updates and host-side hashes are
    avoided.
  • ParticleSort.h / SortBuffer.h provide a bin-sort with a shared
    device-side buffer pool (BinSortBuffers) so repeated sorts in
    scatter/gather hot paths reuse storage.
  • FieldLayout now stores nghost; HaloCells.{h,hpp} thread it through
    accumulateHalo / fillHalo, enabling kernel-width-aware halos (>1
    ghost cell). (Before this PR, nghosts>1 was broken in IPPL)
  • ParticleAttrib exposes scatterPIFNUFFT / gatherPIFNUFFT guarded by
    IPPL_ENABLE_FFT.
  • ParticleAttrib::scatter(...) / ParticleAttrib::gather(...) (CIC overloads used by
    every alpine PIC example) now
    forward to the kernel-aware path with LinearKernel<MeshType> and a
    default ScatterConfig / GatherConfig. PIC paths therefore inherit the
    same atomic / atomic-sort / tiled / output-focused dispatch and
    TileSizeCache lookup as NUFFT.

Communication and Particle Update

This PR removes a per-attribute buf_m send/recv staging view for every
ParticleAttrib. The old update path went
attribute view -> per-attribute buf_m -> MPI buffer → per-attribute buf_m->→ attribute view, with one buf_m per attribute per exchange and copies on
both ends. The new path serializes attribute data directly into a shared,
reused MPI buffer managed by a singleton BufferHandler, and
deserializes straight back into the destination attribute's storage at a
caller-supplied offset. The per-attribute buf_m is not needed for the common path.

  • ParticleAttrib::serialize / deserialize now operate on dview_m
    directly and accept a hash view (gather-on-the-fly of the live
    particle indices going to one rank) and a destination offset. There is
    no intermediate per-attribute buffer to size, allocate, copy into, or
    copy out of. getView() returns a subview over the live range only, so
    attribute kernels never see the overallocated tail.
  • A granularity-aware GPU allocator is added to Archive'. Device
    archives are allocated with raw cudaMalloc / hipMalloc so the
    pointer is page-aligned. Host
    archives keep using a regular Kokkos::View.

Utilities

  • IpplTimings.{h,cpp}: track per-call measurements alongside cumulative
    wall time, support resetAllTimers() (for warmup), and add
    dumpToCSV() for offline analysis. print() reports measurement counts
    and per-occurrence statistics (mean, stddev, min, max).
  • BufferView.h: lightweight, type-erased view over a contiguous device
    byte buffer, used by the new send/recv path.
  • Tuning.h: TileSizeTuner interface to Kokkos Tools autotuner that
    enumerates tile-size candidates for the new scatter/gather back-ends.
  • ParallelDispatch.h: helpers for selecting parallel-for vs.
    parallel-reduce policies based on view rank/extents and execution space.

Alpine

  • alpine/ElectrostaticPIF/{LandauDampingPIF, LandauDampingPIFPruned, BumponTailInstabilityPIF, PenningTrapPIF}.cpp: added as Alpine PIF examples.
  • alpine/ElectrostaticPIF/{ChargedParticlesPIF, ChargedParticlesPIFPruned}.hpp:
    particle-bunch container specialised for the PIF schemes

Unit tests

  • FFT/NUFFT.cpp + NUFFTTestUtils.h: test matrix for NUFFT Type-1
    and Type-2 (small / medium grid, with / without upsampling, three spread
    methods, tolerance sweep at 1e-4 / 1e-7 / 1e-10, FINUFFT cross-check if compiled,
    three gather methods).
  • FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the
    native NUFFT against a direct DFT reference.
  • Interpolation/KernelGatherScatterTest.cpp: charge conservation,
    periodicity, gather-of-constant-field, scatter/sort comparison,
    roundtrip; covers Atomic, Tiled, OutputFocused
  • Interpolation/Binning.cpp: bin-sort correctness and stability tests.
  • Particle/ParticleUpdate.cpp: spatial-layout update tests under uniform
    decomposition.
  • Particle/ParticleUpdateNonuniform.cpp: same for non-uniform domain
    decompositions

Integration tests

  • test/FFT/TestFFTCC.cpp extended for the new pruned variants and the
    heFFTe backend trait.
  • test/particle/benchmarkParticleUpdate.cpp switched to
    Kokkos::Random_XorShift64_Pool device init (avoids host->device copies
    between iterations) and adapted to the new ParticleSpatialLayout
    signature.

PaulFisch added 15 commits May 4, 2026 11:40
- Top-level CMakeLists.txt: add option(IPPL_ENABLE_FINUFFT) and
  option(IPPL_ENABLE_CUFFTMP), both defaulting OFF.
- cmake/Dependencies.cmake: add FetchContent block for (CU)FINUFFT and
  the cuFFTMp / NVSHMEM discovery block. Bump Kokkos default to 5.0.0
  and enable Kokkos_ENABLE_COMPILE_AS_CMAKE_LANGUAGE.
- cmake/InstallIppl.cmake: install finufft / cufinufft targets when
  built in-tree.
- cmake/AddIpplTest.cmake: pass TEST_LINK_LIBS through to non-integration
  tests as well.
- src/CMakeLists.txt: link Heffte/finufft/cufinufft only when FFT and
  the corresponding option are enabled; embed build metadata as an
  opt-in to reduce rebuild churn.
- .gitignore: exclude .idea*, cmake-build-*.
- Split the original FFT.h/FFT.hpp into:
  - Backend/{Backend,CuFFT,CuFFTMp,Heffte,Interface}.h: per-library
    backend selection and plan management.
  - Transform/{CC,RC,PrunedCC,PrunedRC,Trig,Common,Transform,
    NUFFT.{h,hpp}}.h: per-transform front-ends (complex-complex,
    real-complex, pruned variants, trigonometric, NUFFT).
  - Traits.h: backend availability traits and selection helpers.
- Add a native Kokkos NUFFT (Type-1 / Type-2) engine under FFT/NUFFT/:
  NativeNUFFT.h driving the upsampling-FFT pipeline, ESKernel.h
  exponential-of-semicircle spreading kernel with width-specialised
  Estrin polynomial evaluators, Correction.h deconvolution factors,
  Quadrature.h Gauss-Legendre quadrature, NUFFTUtilities.h helpers.
- The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp}
  and supports both the native engine and the (cu)FINUFFT backend
  behind IPPL_ENABLE_FINUFFT.
- Add Scatter/{Scatter,GridParallelScatter,AtomicScatter,TiledScatter,
  ScatterConfig,ScatterArgumentsBase,TileSizeCache}.h: a configurable
  scatter front-end that selects between atomic, output-focused, and
  tiled grid-parallel back-ends, with an optional tile-size cache for
  hardware-specific tuning data loaded at runtime.
- Add Gather/{Gather,AtomicGather,GatherConfig,GatherArgumentsBase}.h:
  matching gather front-end (Atomic / AtomicSort variants).
- Add Binning.h: bin-sort utilities used to group particles by tile so
  scatter/gather kernels can avoid atomics on hot tiles.
- Add Kernels.h: shared kernel-evaluation helpers consumed by the new
  scatter/gather kernels.
- Add WidthDispatcher.h: compile-time switch from runtime kernel width
  to the corresponding template specialization.
- Add CoordinateTransform.h: physical <-> grid coordinate helpers used
  by both scatter/gather and the native NUFFT.
- CIC.hpp: minor signature alignment to the new kernel concept.
- Particle/ParticleSpatialLayout.{h,hpp}: rewrite the spatial layout's
  update path to use a packed locate + single-pass send/recv pipeline
  with reusable scratch (rankSendCount, sendOffsets, sendIds, cursor)
  instead of the per-iteration vector-of-vectors approach. Reuses
  buffers across updates and avoids host-side hashes for the common
  case.
- Particle/ParticleSort.h, SortBuffer.h: bin-sort by spatial layout
  with a shared device-side buffer pool (BinSortBuffers) so repeated
  sorts in scatter/gather hot paths reuse storage.
- Particle/ParticleAttrib.{h,hpp}, ParticleAttribBase.h, ParticleBase.{h,
  hpp}: thread the new sort/bin-aware scatter and gather entry points
  through the attribute interface; add scatterPIFNUFFT/gatherPIFNUFFT
  guarded by IPPL_ENABLE_FFT.
- FieldLayout/{FieldLayout.h,FieldLayout.hpp,SubFieldLayout.hpp}: store
  nghost on the layout so the halo and neighbor lookup pick up the
  correct ghost width when more than one cell is needed.
- Field/HaloCells.{h,hpp}: thread nghost through accumulateHalo and
  fillHalo so the kernel-width-aware halo path works.
- IpplTimings.{h,cpp}: track per-call measurements alongside the
  cumulative wall time, support resetAllTimers() (for warmup), and
  add dumpToCSV() for offline analysis. Print() now also reports
  measurement counts and per-occurrence statistics (mean, stddev,
  min, max).
- BufferView.h: lightweight, type-erased view over a contiguous
  device byte buffer used by the new send/recv path.
- Tuning.h: TileSizeTuner — Kokkos-tools-driven autotuner that
  enumerates tile-size candidates for the new scatter/gather back-ends.
- ParallelDispatch.h: helpers for selecting parallel-for vs
  parallel-reduce policies based on view rank/extents and execution
  space.
- TypeUtils.h: detail::is_complex_v / MultispaceContainer helpers
  used by the buffer-handler-aware logging path and by the field
  inner-product (Hermitian for complex T).
- ParameterList.h, Timer.cpp, ViewUtils.h: small follow-on tweaks.
- Archive.{h,hpp}: extend the device-side serialize/deserialize path
  to handle Vector<T,Dim> + hash views (used by the new packed
  particle send/recv pipeline) and add a granularity-aware GPU buffer
  allocator. On HIP, the GPU page granularity (64 KiB on MI250X /
  MI300X) is required by HSA IPC for large-message GPU-aware MPI.
- BufferHandler.hpp: round buffer requests up to a 4 KiB page so we
  do not churn the GPU-aware MPI registration cache with tiny
  allocations; allocate a fresh buffer on grow rather than realloc to
  avoid stale-IPC issues with reused virtual addresses.
- Communicator.{h,cpp}: add archive-only isend/recv overloads (used
  by the new layout); switch buffer handlers to a static singleton
  via getBufferHandler() so all communicators share one pool.
- CommunicatorLogging.cpp: gatherLocalLogs() now uses an is_a_logger
  SFINAE guard so we only call getLogs() on a LoggingBufferHandler.
- Buffers.{cpp,hpp}: route deleteAllBuffers/freeAllBuffers/getBuffer
  through the singleton handler.
- Environment.{h,cpp}: initialize MPI with MPI_THREAD_MULTIPLE and
  expose threadMultiple() so callers know whether multithreaded sends
  are safe.
- Field/BareField.{h,hpp}: drop the nghost default on initialize() /
  updateLayout() now that callers have to pick the correct width;
  use ippl::detail::infinity<T>() for the Vector min/max identities
  so reductions with integer T compile.
- Field/BareFieldOperations.hpp: support complex-valued inner products
  via Hermitian conjugation and split allreduce of real/imag parts
  (std::plus<complex> is not a standard MPI Op).
- Field/Field.hpp: small helper for the new scatter integration.
- PoissonSolvers/FFT{Open,Periodic,TruncatedGreenPeriodic}PoissonSolver.h:
  pull heffteBackend from the new fft::HeffteBackend trait instead of
  FFT_t::heffteBackend.
- Types/IpplTypes.h, Vector.hpp: expose detail::infinity<T> for both
  floating-point and integer T (used by the BareField reduction
  identities).
- FEM/LagrangeSpace.hpp: simplify a parameter type (point_t alias).
- Ippl.cpp: call detail::finalizeBinSortBuffers() on shutdown to
  release the new sort scratch.
- Ippl.h: drop the implicit ParallelDispatch include; consumers that
  need it (HaloCells, ParticleSpatialLayout) include it directly.
- alpine/ElectrostaticPIF/{LandauDampingPIF,LandauDampingPIFPruned,
  BumponTailInstabilityPIF,PenningTrapPIF}.cpp: physics drivers using
  the new NUFFT engine to scatter charge onto Fourier modes and
  gather the resulting field at particle locations.
- alpine/ElectrostaticPIF/{ChargedParticlesPIF,ChargedParticlesPIFPruned}.hpp:
  particle-bunch container specialized for the PIF schemes
  (NUFFT-based density and force evaluation, energy diagnostics).
- alpine/ElectrostaticPIF/CMakeLists.txt: build the four executables;
  link (cu)finufft when IPPL_ENABLE_FINUFFT is on.
- alpine/CMakeLists.txt: register ElectrostaticPIF as an alpine
  subdirectory.
- alpine/{BumponTailInstabilityManager,LandauDampingManager,
  PenningTrapManager}.h, ParticleContainer.hpp: small adjustments
  picked up via the new ParticleSpatialLayout signatures.
…ticleUpdate

- FFT/NUFFT.cpp + NUFFTTestUtils.h: TYPED_TEST matrix for NUFFT
  Type-1 and Type-2 (small/medium grid, with/without upsampling,
  three spread methods, tolerance sweep at 1e-4/1e-7/1e-10, FINUFFT
  cross-check, three gather methods).
- FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the
  native NUFFT against a direct DFT reference.
- Interpolation/KernelGatherScatterTest.cpp + Utils.h: charge
  conservation, periodicity, gather-of-constant-field, scatter/sort
  comparison, roundtrip; covers Atomic, Tiled, OutputFocused
  variants (incl. Z-batched).
- Interpolation/Binning.cpp: bin-sort correctness and stability tests
  for the new bin_sort path.
- Particle/ParticleUpdate.cpp: spatial-layout update tests under
  uniform decomposition.
- Particle/ParticleUpdateNonuniform.cpp: same but for non-uniform
  domain decompositions to exercise the rebinning logic.
- Particle/ParticleSendRecv.cpp: minor updates for the new
  Communicator API.
- Wire all new tests into the CMakeLists.txts for unit_tests/{FFT,
  Interpolation,Particle}.
- test/FFT/TestFFTCC.cpp: extend the existing FFT correctness check
  to exercise the new pruned variants and the heFFTe backend trait.
- test/particle/benchmarkParticleUpdate.cpp: switch warmup to
  Kokkos::Random_XorShift64_Pool device init (avoids host->device
  copies between iterations) and adjust to the new
  ParticleSpatialLayout signature.
- test/particle/benchmarkParticleUpdateScaling.cpp: new strong-scaling
  benchmark that exercises ParticleSpatialLayout::update across a
  range of particle counts and ranks.
- test/particle/CMakeLists.txt: register the new scaling benchmark.
- test/{kokkos/TestVectorField{,2,3,4},field/TestFieldBC,particle/PICnd,
  serialization/serialize01}.cpp: minor signature touch-ups
  (host_mirror_type, isAllPeriodic flag, etc.) so the existing
  integration tests still build against the merged-in upstream and
  the new layout APIs.
- test/CMakeLists.txt: minor adjustment to keep all subdirectories
  registering correctly under the new structure.
- Correction.h: applyPreCorrectionType2 now multiplies by factor (not
  conj(factor)), matching the working pruned path.
- AtomicScatter.h: clamp team_size and vector_length to 1 on Serial.
- BufferHandler test: use page-aligned expected sizes; resized larger
  test now expects the smaller free buffer to be retained.
- NUFFT test: zero the non-corner-DC band of the upsampled input.
- ParticleBase test: re-acquire host mirror after destroy().
- LandauDampingManager: open the manager CSV in OVERWRITE on the first
  step and APPEND afterwards, so re-running the test does not produce a
  file with multiple header blocks.
- NUFFTAccuracy: call computeDFTReference before the owned-mode early-
  exit so every rank participates in the same MPI_Allreduce sequence.
- GatherScatterTest::ScatterCustomHashTest: size the host fill loop by
  the post-update local count (getHostMirror() now returns a mirror of
  the live-particle range).
- ParticleUpdateNonuniform::ParticleInjectionBetweenOrbRepartitions:
  the test creates a fresh bunch each cycle, so compare against
  injectPerCycle rather than a cumulative count.
- unit_tests/FFT/CMakeLists.txt: bump NUFFTAccuracy timeout to 600s.
- CMakeLists.txt: add -fno-lto for the CUDA platform to avoid host-LTO
  conflicts with the duplicate fatbinData symbols nvcc emits.
- AddIpplTest.cmake: introduce IPPL_DEFAULT_INTEGRATION_TIMEOUT=300 and
  use it for INTEGRATION tests (the unit-test default of 60s was too
  tight on slower OpenMP runs).
- BareFieldOperations.hpp: pre-capture view1/view2 with (void)-casts so
  nvcc accepts the extended host/device lambda when the constexpr-if
  branches first-use them.
- NUFFT.cpp: in runType2Test, compare absolute error against the
  L-infinity-norm-scaled tolerance instead of a pointwise relative
  error. The pointwise refReal can be arbitrarily close to zero for
  random fields, which made the test sensitive to the (rank, thread)
  layout that drives the random seeding.
- Bump TIMEOUTs that were overshooting on multi-thread OpenMP:
  unit_tests/FFT NUFFTAccuracy (1800), Interpolation
  KernelGatherScatterTest (600), PIC ORB (300), test/solver/fem
  TestNonhomDirichlet_1d_preconditioned/TestPeriodicBC_sin{,_preconditioned}
  (600). Also fix the misplaced TIMEOUT keyword inside LABELS lists in
  the solver/fem CMakeLists.
…ives

Archive<Properties...> previously took the cudaMalloc/hipMalloc path
unconditionally whenever KOKKOS_ENABLE_CUDA or KOKKOS_ENABLE_HIP was
defined. That meant Archive<Kokkos::HostSpace> — used by host-side
exec spaces such as Kokkos::Serial / Kokkos::OpenMP — was backed by a
device pointer, but its serialize() runs std::memcpy on the host and
therefore segfaulted on the very first MPI exchange under a CUDA
build with more than one rank.

Switch the storage choice to depend on the archive's memory space
rather than the build configuration. Device archives still get the raw
cuda/hipMalloc path (preserving the page-aligned / IPC-stable behaviour
needed for the Cray-MPICH HIP IPC bug); host-accessible archives use a
regular Kokkos::View<char*, MemorySpace> and the host-side memcpy in
serialize() now writes to a host pointer.
…work

ParticleAttrib::scatter(...) and ParticleAttrib::gather(...) (the
kernel-less, atomic-CIC overloads used by every alpine PIC example)
now forward to the kernel-aware scatter_kernel/gather path with
LinearKernel<MeshType> and a default ScatterConfig/GatherConfig. This
gives the PIC paths the same atomic / atomic-sort / tiled / output-
focused dispatch (and TileSizeCache lookup) as NUFFT, while keeping
the legacy hash-and-custom-range overload available for
TestHashedScatter / GatherScatterTest::ScatterCustomHashTest.

Higher-dimensional fields (Dim > 3) still use the direct CIC kernel
because the new framework's AtomicScatter/AtomicGather only have
1-, 2-, 3-D specialised paths.

Interpolation::Gather/Scatter::DeduceTypes now derives RealType from
the kernel's value_type instead of from the position attribute. With
mixed precisions (float positions on a double mesh, e.g. PICTest with
float T) this avoids silently downcasting the mesh spacing and
preserves bit-identical results vs. the legacy CIC path
(ASSERT_DOUBLE_EQ in PICTest::Gather is now exact again, no
relaxation needed).
@aaadelmann aaadelmann requested review from aaadelmann, aliemen and srikrrish and removed request for aliemen May 4, 2026 10:31
@aaadelmann aaadelmann added enhancement New feature or request gitlab-mirror labels May 4, 2026
Comment thread alpine/ParticleContainer.hpp Outdated
Comment thread cmake/Dependencies.cmake
Comment thread src/FFT/Backend/Heffte.h Outdated
Comment thread src/FFT/NUFFT/ESKernel.h
Comment thread unit_tests/FFT/NUFFT.cpp Outdated
@srikrrish
Copy link
Copy Markdown
Member

cscs-ci run cscs-ci-gh200, cscs-ci-mi300, cscs-ci-openmp

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I still have a commit fixing some of the compile failures on juwels, I will also address your comments @srikrrish

@aaadelmann aaadelmann self-assigned this May 6, 2026
Copy link
Copy Markdown
Member

@aaadelmann aaadelmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complexity is super high for one PR. The PR combines FFT backend refactor, NUFFT, scatter/gather redesign, particle migration rewrite, autotuning, timing changes, build-system changes, examples, and many tests. That makes review and regression attribution difficult.

Would it be possible to have the FFT as a separate PR?

@@ -0,0 +1,610 @@
// ChargedParticlesPIF header file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git diff --check reports trailing whitespace in ChargedParticlesPIF.hpp (line 535). There are also long lines and local comments like “Could definiitely be improved” in Scatter.h (line 121), which should not go into production code.

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

Complexity is super high for one PR. The PR combines FFT backend refactor, NUFFT, scatter/gather redesign, particle migration rewrite, autotuning, timing changes, build-system changes, examples, and many tests. That makes review and regression attribution difficult.

Would it be possible to have the FFT as a separate PR?

Thanks for the feedback, and I do see the point about the PR being large. Before going through the individual pieces, some context that I hope is useful: I have put a considerable amount of time into preparing this PR, unpaid and without other compensation, upon specific and repeated request from your team. I was very glad to do so, but I do also have a job and a master thesis running in parallel. If there is no interest in merging the changes after all, I would really have appreciated hearing that sooner, before I spent tens of hours on restructuring.

Going through the components:

  • The FFT (where I assume the NUFFT is meant) relies quite directly on the Scatter/Gather redesign, as higher-order Scatter/Gather is at the core of a NUFFT. Without it, there is no NUFFT. The algorithm for width-1 (i.e. PIC) does not change. Since IPPL previously had no support for higher-order Scatter/Gather, it is unfortunately not possible to merge the NUFFT without also adding the Scatter/Gather work.
  • The FFT refactor itself can easily be undone by merging all the currently included headers in FFT.h back into a single ~2000-LOC file, since I did not significantly change the FFT code itself but only broke it up into pieces. Happy to do that if it would help review.
  • The particleUpdate is indeed separate and I can pull it out of the PR relatively easily. As shown in the plot above, it gives a substantial performance boost, and I also added a significant amount of tests to the particle update which surfaced a few minor bugs in the previous version. Happy to revert if that is preferred.
  • The Auto-Tuning feature is used solely for the higher-order Scatter/Gather kernels, which do not currently exist in IPPL. It could be removed, at the cost of noticeably decreased performance. It selects optimal scatter/gather strategies based on local particle density, and was something Sri was particularly interested in for imbalanced distributions.
  • I can remove the timing changes if desired. I would still mention that timings generated without warmup, which was the case before, can be quite misleading. For PIC, a significant share is spent in IPC buffer registration on Alps.
  • The build system changes are quite limited: only what is needed for the higher-order Scatter/Gather kernels, a Kokkos default bump to 5.0.0 (which I can revert), and the optional cuFINUFFT dependency. I can also remove the cuFINUFFT path entirely, though it does perform better than the alternative in some cases.
  • I can certainly remove tests from the PR, though I would push back on that one. Especially for a PR of this size, the test coverage should be kept.
  • Regarding the "this could be better" comments in Scatter.h, I am happy to remove them if preferred. To clarify what they are about: they refer to strategies for selecting between scatter algorithms, not to the algorithms themselves. Every algorithm in there is faster than what was in IPPL before. The comments are simply meant as hints for future contributors who might find even better selection heuristics, since I am sure better ones exist. If the view is that the code is not fast enough for IPPL without that further improvement, I would just note that by the same standard, the previous scatter/gather and particle update paths would also not have qualified. Either way, happy to take the comments out.

@aaadelmann
Copy link
Copy Markdown
Member

aaadelmann commented May 6, 2026

I totally understand your boundary conditions. No one has said that we will not merge or that you should undo anything - but again it is simply a complexity problem. Complexity in reviewing and testing. Consider my comment on breaking up as "resolved". If we really need this we can do the split up the merge my ourself.

@srikrrish
Copy link
Copy Markdown
Member

Currently the Jülich's CI/CD have some token issues (in the final results pushing to the web page) and me and my colleague are looking into it. But I was able to see the results of the tests directly in gitlab and they were all passing. I now see that CSCS's CI is also passing. @PaulFisch We greatly appreciate your work and time on this and in case you find a bit of time to improve the documentation that would be great. But otherwise also we can understand.

@aliemen Could you may be pull this branch and check the open solver or whatever is relevant for you/OPALX? @aaadelmann Could you may be run some tests on LUMI as you mentioned before?

Since Paul has already posted results on Alps, we will be then good to merge in my opinion.

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I can perform some LUMI scalings, I already have the binaries compiled there. It would probably indeed be good to check some applications before merging this.

@srikrrish
Copy link
Copy Markdown
Member

I can perform some LUMI scalings, I already have the binaries compiled there. It would probably indeed be good to check some applications before merging this.

That would be great. Thanks @PaulFisch . As I mentioned before @aliemen and/or @aaadelmann can check some OPALX relevant stuffs I think.

@aliemen
Copy link
Copy Markdown
Collaborator

aliemen commented May 7, 2026

@srikrrish yes, I will compile OPALX against the branch and try to run our regression tests. Will come back with the results probably later today or tomorrow morning.

@aaadelmann
Copy link
Copy Markdown
Member

Sonali will run all her Solver tests also on high node counts

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I also just found a bug in both my branch and the master with the overallocation being cast to integers everywhere. I have used in the alps PIC run a non-integer overallocation for both master and my branch. I will push that together with the documentation and add an updated scaling here when the runs finish.

On LUMI, both master and my branch segfault after a few iterations. I am unsure if this is due to FS issues today (https://lumi-supercomputer.eu/lumi-service-status/information-lustre-filesystem-performance/), the heffte issue, OOM, or something else. I am trying to triage.

Happy to help/discuss in case there is regressions anywhere and we need to split up the PR in some way. I had conflicting meetings on Tuesdays the last weeks unfortnately, but also happy to set up a meeting if required.

Comment on lines +127 to +133
# Bake into a generated header consumed by TileSizeCache / GatherCache.
set(IPPL_AUTOTUNE_PRESET_DIR "${_dst_dir}")
set(IPPL_AUTOTUNE_ARCH_TAG "${_tag}")
configure_file(
"${CMAKE_SOURCE_DIR}/cmake/IpplAutoTunePresets.h.in"
"${CMAKE_BINARY_DIR}/include/IpplAutoTunePresets.h"
@ONLY)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks when IPPL is consumed via FetchContent from OPALX because CMAKE_SOURCE_DIR points to the top-level OPALX source directory, not the IPPL source directory. In that case AutoTunePresets.cmake tries to read:

${CMAKE_SOURCE_DIR}/cmake/IpplAutoTunePresets.h.in

which resolves to opalx/cmake/IpplAutoTunePresets.h.in, but the template actually lives in IPPL's own cmake/ directory.

A fix is to resolve paths relative to AutoTunePresets.cmake itself, e.g.

set(_ippl_cmake_dir "${CMAKE_CURRENT_FUNCTION_LIST_DIR}")
set(_src_dir "${_ippl_cmake_dir}/auto_tune/${_tag}")

configure_file(
  "${_ippl_cmake_dir}/IpplAutoTunePresets.h.in"
  "${CMAKE_BINARY_DIR}/include/IpplAutoTunePresets.h"
  @ONLY)
...

With this suggestion, the cmake runs fine. Still doesn't compile - but that's the Kokkos 5 issue. Will comment again when I succeeded.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add that. Is OpalX incompatible with Kokkos 5.x, or is there something specific that I do with it? I think I am not using any kokkos 5.x features, I should be able to bump down the version easily

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, I was bumping the version because Kokkos 5.x improves performance on some relevant parts (in particular atomics), so long-term it is probably nice to have Kokkos 5.x in OPALX, but it should not block merging this, I can just bump down the version again

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I thought most of the other errors where Kokko-5, but it works with Kokkos 4. With my fixes below (using size_type everywhere) I was able to compile it. OPALX still crashed, but I am looking into it.

using execution_space = typename Types::execution_space;

struct Arguments : GatherArgumentsBase<Arguments, Types> {
using PermuteView = Kokkos::View<uint64_t*, memory_space>;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places, Kokkos::View<uint64_t*, memory_space> and Kokkos::View<ippl::detail::size_type*, memory_space> are mixed (maybe that's just a problem, because OPALX uses size_type more often...). I think it would be nice to use size_type everywhere.

On macOS this fails because std::size_t is unsigned long, while uint64_t is unsigned long long, so these are different Kokkos view value types and cannot be assigned to each other.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to address. The problem is that size_type is not size_t, but on some exec spaces a 32-bit integer, and the particle number can easily exceed maxint.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but the uint64_t is definitely a bug, it should be size_t everywhere for wider int.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do you mean ippl::size_type or the Kokkos size_type that is a typedef of the exeuction spaces? I can try to use uniform ippl::size_type, that is surely the cleanest solution, that is also probably what you meant.

Copy link
Copy Markdown
Collaborator

@aliemen aliemen May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I was talking about ippl's size_type. In OPALX we aliased something like using size_type = ippl::detail::size_type, so I was talking about that one.

Comment thread src/Particle/ParticleAttrib.h
@aliemen
Copy link
Copy Markdown
Collaborator

aliemen commented May 7, 2026

Alright, after the size_type fix, the cmake path fix and the R.extent(0) vs R.size() fix, I was able to compile and run OPALX. Then, I had to fix a unit test that couldn't handle the temporary subview returned by getView - but that was an OPALX problem again.

Now, all of our unit and regression tests passed cleanly on CPU and GPU (GH200 on Daint). Our regression tests only test single rank. But from my local runs, all of them looked good with up to 8 ranks (CPU).

Note that I pushed a branch here to IPPL, test-pr-501, that I used to compile OPALX against where I applied my fixes. Feel free to look at it if you want to see what would work with OPALX. Changes I made to OPALX are in 412-..., will also make PR as soon as possible.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf
TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

@srikrrish
Copy link
Copy Markdown
Member

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

@srikrrish
Copy link
Copy Markdown
Member

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

Could you may be just submit the run corresponding to that data point only to see if it may be some one time effect?

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

@srikrrish
Copy link
Copy Markdown
Member

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

Thanks a lot. Seems something that @paubauer may be able to help. The standalone reproducer could be quite useful for him. No need to spend a lot of time on this especially if it seems to lead to a rabbit hole.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

I launched the job for that specific data point, it's in the queue. As for the heffte issue, seems related to Issue #481 .

Comment on lines +217 to +220
if constexpr (Types::KernelType::has_width_template)
kw_ptr[D * W + i] = args.kernel.template eval<W>(xi);
else
kw_ptr[D * W + i] = args.kernel(xi);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, this is a problem with cuda 12.1 and gcc 12 on Gwendolen. I cannot compile OPALX against it:

.../cuda-ampere80/build/_deps/ippl-src/src/Interpolation/Scatter/AtomicScatter.h(218): error: class "ippl::Interpolation::LinearKernel<double>" has no member "eval"
                          kw_ptr[D * W + i] = args.kernel.template eval<W>(xi);
                                                                   ^

CUDA 12.1 with GCC 12 is diagnosing the discarded eval<W> expression inside a Kokkos device lambda/team kernel. Newer CUDA, like 12.9 with GCC 14, apparently handles that template/device-lambda case correctly...

We could either find a workaround (like making all interpolation kernels provide a harmless width-templated wrapper) or we say "doesn't work with cuda/12.1" and ignore it. But that would mean we need a new module on Gwendolen (which is the case for Kokkos 5 anyways). @aaadelmann what's your opinion?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were still a lot of compiler bugs with this kind of expressions in cuda 12.9, probably it's not too difficult to build a workaround on my side, I will check.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

The new run still produced a similar timing, with the PCG flattening at 512 GPUs.

@aaadelmann
Copy link
Copy Markdown
Member

@s-mayani if you still have time until departure, send me all timing files.

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

@srikrrish
Copy link
Copy Markdown
Member

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

More iterations to converge seems to be the most concerning thing for me. @s-mayani Do you know whether you are also seeing the same behaviour in your tests?

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I should clarify that I did not actually see my branch needing more iterations to converge. I was just speculating what might be the issue for the difference in performance. If it is the mallocs, the cleaner way is likely to slightly refactor the PCG not to malloc in every iteration of the CG, as that would give a massive performance boost. Given that I did not observe a slow down within an iteration in my tests, I just stated that theoretically it could be that my branch needs more iterations, which would also explain longer runtime. That would be something that I would need to investigate, if that would be the case.

@aaadelmann
Copy link
Copy Markdown
Member

If you could paste the timing-profiles into this PR, I will make an issue out of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request gitlab-mirror

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants